Learn and shine

Saturday, March 26, 2016

Getting started with HBase, HBase Compactions, Load data into HBase Using Sqoop

This post will explain you abou HBase Compactions, how to install HBase and start the Hbase, HBase Basic operations.
How to load data into HBase using sqoop.
HBase Compactions

1. HBase writes out immutable files as data is added
a). Each store consists rowkey-ordered files.
b).Immutable- more files accumulated over time.
2. Compaction rewrite several files into one
a).Lesser files – Faster reads
3. Major compaction rewrites all files in a store into one
a).Can drop deleted records and older versions
4. In a minor compaction, files to compact are selected based on a heuristic.

How to install HBase and start the same.
1. First download latest version HBase from http://www.apache.org/dyn/closer.cgi/hbase/ or https://hbase.apache.org/
2. Once Downloaded, then try to un tar the same.
3. tar –xvzf hbase-1.0.1.1-hadoop1-bin.tar.gz
4. Go to /usr/local/hbase/hbase-1.0.1.1/
5. ./bin/start-hbase.sh
6. Once it is started , then
7. ./bin/hbase shell

We can see the shell window to work with. Try to enter list. It will show you list of existing tables.
If we are able to execute this command means our hbase started successfully without any issue.

Hbase>list

Now we will see sql operations through HBase.
HBase Basic operations
Create a table syntax
Create ‘table_name’ , ‘column_family’

HBase>Create ‘htest’,’cf’

Insert data

put ‘table_name’ ,’row_key1’,’column_family:columnname’,’v1’

Update data

put ‘table_name’ ,’row_key1’,’column_family:columnname’,’v2’

Select few rows

get  ‘table_name’ ,’row_key1’

Select whole table

scan ‘table_name’

Delete particular row value

delete  ‘table_name’ ,’row_key1’,’column_family:columnname’

Alter existing table
Before alter the table, first we need to disable the same table

disable ''
alter '' ,{NAME=''}

Drop the table
First disable the existing table, which we supposed to be drop

Hbase>Disable  ‘testdrop1’
Hbase>drop  ‘testdrop1’

How to create table from java and insert the data to the same in HBase table ?
First open eclipse-> create a new project ->class->HBaseTest.java
Copy and paste the below code. If any compilation errors then add the respective Hbase jars the same

Public class HBaseTest {
  Public static vaoid main(String args[]) throws  IO Exception{
 //We need Configuration object to tell the client where to connect.
//when we create a HBaseConfiguration , it reads whatever we have set into our hbase-site.xml, and //hbase-default.xml, as long as these can be found in the classpath
    Configuration config = HBaseConfiguration.create();
 //Instantiate HTable  object, that connects the testHBaseTable
//Create a table with name  testHBaseTable,  if it is not available.
   HTable table = new HTable(config,” testHBaseTable”);
//To Add a row use Put, Put constructor takes the name of the row which we want to insert into a //byte array, in HBase , the Bytes class has utility to converting all kinds of java types to byte arrays.
Put p = new Put(“testRow”);

//to set the value to row , we would like to update in the row testRow .
//Specify the column family. Column qualifier and value of the table.
//cell we would like to update then the column family must already exist.
//in our table schema the qualifier can be anything
//All must be specified as byte arrays as hbase is all about byte arrays.
p.add(Bytes.toBytes(“littleFamily”),Bytes.toBytes(“littleQualifier”),Bytes.toBytes(“little Value”));
//Once we have updated all the values for Put instance. Then HTable#put method takes Put instance  //we have building and pushes the change we made into HBase.
table.put(p);
//Now, to retrieve the data which we have just wrote the table;
Get   g = new Get(Bytes.toBytes(“testRow”)
Result   r = table.get(g);
byte [] value = r.getValue(Bytes.toBytes(“littleFamily”),Bytes.toBytes(“littleQualifier”));
String ValueString = Bytes.toString(value);
System.out.println(“GET:”+valueString);
//Some times we don’t know about row name, then we can use the scan to retrieve all the data from //the table
Scan s = new Scan();
s.addColumn(Bytes.toBytes(“littleFamily”), Bytes.toBytes(“littleQualifier”));
ResultScanner scanner = table.getScanner(s);
try{
   for (Result rr = scanner.next();rr!=null;rr=scanner.next()){
    System.out.println(“Found Row record:”+ rr);
  } 
}
finally{
scanner.close();
}
 }
 }

Different ways to load the data into HBase
1. HBase Shell
2. Using Client API
3. Using PIG
4. Using SQOOP

How to load data into HBase using SQOOP?
Sqoop can be used directly import data from RDBMS to HBase.
First we need to install sqoop.
1. Download sqoop http://www.apache.org/dyn/closer.lua/sqoop/1.4.6
2. Untar the Sqoop

tar -xvzf sqoop-1.4.6.bin__hadoop-0.23.tar.gz

3. Go upto bin. then run the executing below command.

sqoop import
               --connector jdbc:mysql://\
                --username  --password 
                --table
                --hbase-table 
                --column-family 
                --hbase-row-key 
                --hbase-create-table

This is how we will work with HBase.
Thank you very much for viewing this post.

Friday, March 25, 2016

HBase Basics, HBase Architecture, Getting started with No SQL Database HBase, HBase Components

This post will explain you about History of Hbase and HBase Architecture,Basic details about HBase, Different types of No SQL Databases
History of HBase
Started in Google.
GFS -> HDFS
MapReduce-> MapReduce
Big Table -> Apache HBASE

Any SQL system – RDBMS
1. Users data is increasing, then we will implement cache mechanism to improve performance.
2. Cache mechanism also having certain limlits.
3. Remove indexing.
4. Avoiding joins
5. Materialized view .
If we use above, then advantages of RDBMS has gone.
Google also faced same problem, then they started with Big Table.
For faster performance we use HBase.
What ever the features hive will not support like crud operations, we can do with HBase.
If anything need to be updated in real time access ,HBase if very useful.
Ad Targeting in real time is very faster.
What is Common problem with existing data processing with Hadoop or Hive?
1. Huge Data
2. Fast Random Access
3. Structured Data
4. Variable Schema- will support to enhance or increase the column names at runtime, which is RDBMS is not supported.
5. Need of compression
6. Need of Distribution(Shading)
How Traditional System(RDBMS) will solve this?
Case: If we want to design Linkedin database to maintain connections?
There 2 tables
1. Users – id,Name,Sex,age
2. Conenctions- User_id,Connection_id,type
But in case of HBase, we can save all the details about users and connections in same column family.
Characteristics of Probable
1. Distributed Database
2. Sorted Data
3. Sparse Data Store
4. Automatic Sharding.

Sorted Data
Example : How data stored in sorted way?
1. www.abc.com
2. www.ghf.com
3. Mail.abc.com
When ever user try’s to access abc.com , then mail.abc.com will not be returned in case of normal storage.

If we use sorted storage then data will be stored like below.
com.abc.www
com.abc.mail
com.ghf.www

If we store like above, then it is easy to access the same.
Sparse Data store
This is mathematical term. If there is null value for particular column , then it will not store.

No SQL Landscape

1.Each No SQL databases as mentioned above is same, they have developed for their purpose.
2.Dynamo is developed by Amazon and it available in Cloud. We can access the same.
3.Cassandra developed by Facebook and they will be using the same. It is combination of Dynamo and HBase, all the features available in Cassandra.

Any No SQL database will have all the characteristics.
It will satisfy only two property at the same time.

HBase Definition
It is a non -relational (NoSQL)database, which stores data in key value pair and it is also called as hadoop database.
1. It Sparse
2. Distributed
3. Multi –dimensional (table name,column name,timestamp) etc..
4. Sorted Map
5. Consistent

Difference between HBase and RDBMS

When to use HBase

When not to use HBase?
1. When you have only few thousand or millions records then it is not advisable to use HBase.
2. Lacks RDBMS commands, if our database requires sql commands then also not go for Hbase
3. When we have hardware less than 5 Data Nodes when replication factor is less than 3, then no need of HBase. It will overhead for system

HBase can run in local system –but this should be considered for a development configuration.
How face book uses HBase as their Message System

1.facebook monitored their usage and figured our what they really needed.
What they needed was a system that could handle two types of data pattern
1. A short set of temporal data that tends to be volatile
2. An ever growing set of data that can be accessed rarely.

1. Real Time
2. Key Value
3. Linearly
4. Big Data
5. Column oriented
6. Distributed
7. Robust
8. Scalable
9. Open source
These are the characteristics of HBase.
HBase is using not only facebook.But also twitter,yahoo etc… they will use to process their large volume data.

Major components of HBase
1. The HBase Master
It will store all the Hbase table and it will coordinate
2. The HRegion server
Actual data will be stored in this server
3. The HBase Client
We will interact to do the crud operations and processing the data

It is same like name node in HDFS
How data distribution will happen in HBase?

We are having data rows from  A to Z
     Rows                                               Servers
    A1,A2 –                Region  Null - A3                  Region server1
    B2,B3,B23,B43-         Region  B2 – B43                   Region server2      
    K1,K2,Z30 -            Region K1 – Z30                    Region server3

How HBase will write data to the file?

1.Every HBase requires confirmation from both Write Ahead Log (WAL) and the MemStore.
2.The two steps ensures that every write to HBase happens as fast as possible while maintaining durability.
3.The Memstore is flushed to a new HFile when it fills up.
4.Usally Memstore default size 256MB, once it is filled up then , it move that information to HFile it's default size is 64 KB.
5.It will be act as a immutable object.
HBase Read File
1.Data is reconciled from the block cache, The Mem-Store and the HFiles to give the client an up to date view of the rows which client requested for.
2.HFiles contain a snapshot of the Memstore at the point when it was flushed. Data for a complete row can be stored across multiple HFiles.
3. In order to read complete row, HBase must read across all HFiles that might contain information for that row in order to compose the complete record.

HFile Compaction

All HFiles will be compacted and put as Compacted HFile.
HBase Components
1. Region – a range of rows stored together
2. Region servers- serves one or more regions
a. A region served by only one region server
3. Master Server – Daemon responsible for managing HBase cluster.
4. HBase stores its data into HDFS- Relies on HDFS’s High availability and fault tolerance.
HBase Architecture

This architecture will explain you about how Hbase will work.

This is Basics about HBase. My next post you can see How to install and work with HBase.
Thank you very much for viewing this post.

Thursday, March 24, 2016

Hive Dynamic , Static Partitions,User defined functions(UDF) with Java

This post is having more advanced concepts in Hive like Dynamic Partition, Static Partition, custom map reduce script, hive UDF using java and python.

Configuring Hive to allow partitions
A query across all partitions can trigger with an enormous Map Reduce Job, if the table data and number of partitions are large. A highly suggested safety measure is putting Hive into strict mode, which prohibits queries of a partitioned table without a WHERE clause that filters the partitions.
We can set the mode to nonstrict, as in the following session.

Dynamic Partitioning –configuration

Hive> set hive.exec.dynamic.partition.mode=nonstrict;
Hive> set hive.exec.dynamic.partition=true;
Hive> set hive.enforce.bucketing=true;

Once we have configured, Then we will see how we will create a dynamic partition

Example:
Source table:
1. Hive> create table transaction_records(txnno INT,txndate STRING,custno INT,amount DOUBLE,category STRING, product STRING,City STRING,State String,Spendby String )row format delimited fields terminated by ‘,’ stored as textfile;
Create Partitioned table:
1.  Hive> create table transaction_recordsByCat(txnno INT,txndate STRING,custno INT,amount DOUBLE, product STRING,City STRING,State String,Spendby String )
Partitioned by (category STRING)
Clustered by(state) INTO 10 buckets 
row format delimited fields terminated by ‘,’ stored as textfile;

In the above partitioned query we are portioning table depending on the category and bucketing by 10 that means it will create 0-9 buckets and assign the hash value the same.

Column category no need to provide in table structure , Since we are creating partition based on the category

Insert existing table data into newly created partition table.

Hive>from transaction_records txn  INSERT OVERWRITE TABLE table transaction_recordsByCat PARTITION(category) select txn.txnno ,txn.txndate,txn.custno,txn.amount,
txn. product,txn.City,txn.State,txn.Spendby ,txn.category DISTRIBUTE BY category;

Static partition
If we get data every month to process the same, we can use the static partition

Hive> create table logmessage(name string,id int) partitioned by (year int,month int) row format delimited fileds terminated by ‘\t’;

How to insert data for static partition table?

Hive>alter table logmessage add partition(year=2014,month=2);

Custom Map Reduce script using Hive

Hive QL allows traditional map/reduce programmers to be able to plug I their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Sample data scenario
We are having movie data, different users will give different ratings for same movie or different movies.

user_movie_data.txt file having data like belowuserid,rating,unixtime

1      1       134564324567
2      3       134564324567
3      1       134564324567
4      2       134564324567
5      2       134564324567
6      1       134564324567

Now with above data, we need to create a table called u_movie_data,then we will load the data to the same.

Hive>CREATE TABLE u_movie_data(userid INT,rating INT,unixtime STRING) ROW FORMATED DELIMITED FIELDS TERMINATED BY ‘\t’ STROED AS TEXTFILE;
Hive> LOAD DATA LOCAL INPATH ‘/usr/local/hive_demo/user_movie_data.txt’ OVERWRITE INTO TABLE u_movie_data;

We can use any logic which will be converted unix time into weekday, any custom integration. Here we used python script.

Import sys
Import datetime
for line in sys.stdin:
          line = line.strip()
         userid,movieid,rating,unixtime=line.split(‘\t’)
        weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
      print ‘\t’.join([userid,movieid,rating,str(weekday)])

How we will execute python script in hive, first add the file into Hive shell?

Hive> add FILE /usr/local/hive_demo/weekday_mapper.py;

Now load the data into table, we need to do TRANSFORM

INSERT OVERWRITE TABLE u_movie_data_new
       SELECT  TRANSFORM(userid,movieid,rating,unixtime)
      USING ‘python weekday_mapper.py’ 
      AS (userid,movieid,rating,weekday) from u_movie_data;

Hive QL- User-defined function
1.Suppose we have 2 columns – 1 is id of type string and another one is unixtimestamp of type String.
Create a data set with 2 columns(udf_input.txt) and place it inside /usr/local/hive_demo/

one,1456432145676
       two, 1456432145676
       three, 1456432145676
       four, 1456432145676
       five, 1456432145676
       six, 1456432145676

Now we can create a table and load the data the same.

create table udf_testing (id string,unixtimestamp string)
              Row format delimited fields terminated by ‘,’;
   Hive>  load data local inpath ‘/usr/local/hive_demo/udf_input.txt’
   Hive>select * from udf_testing;

Now we will write User defined function using java to get more meaningful date and time format.

Open eclipse->create new java project and New class- add the below code inside java class.
Add the jars from hive location.

Import java.util.Date;
Import java.text.DateFormat;
Import org.apache.hadoop.hive.ql.exec.UDF;
Import org.apache.hadoop.io.Text;
public class UnixTimeToDate extends UDF {
    public Text evaluate(Text text){
     if(text==null) return null;
        long timestamp = Long.parseLong(text.toString());
        return new Text(toDate(timestamp));
   }
private String toDate(long timestamp){
   Date date = new Date(timestamp*1000);
   Return DateFormat.getInstance().format(date).toString();
}
}

Once created, then export jar file as unixtime_to_java_date.jar
Now we need to execute jar file from Hive
1. We need to add the jar file in hive shell

Hive>add JAR /usr/local/hive_demo/ unixtime_to_java_date.jar;
      Hive>create temporary FUNCTION  userdate  AS  ‘UnixTimeToDate’;
      Hive> select id,userdate(unixtimestamp) from udf_testing;

This is how we will work with hive. Hope you like this post.
Thank you for viewing this post.

Monday, March 21, 2016

Apache Hive Advanced topics

This post will describe more concepts in Hive
Partitions:
1. How data is stored in HDFS
2. Grouping databases on some column
3. Can have one or more columns.
How partitioning will work?
Usually tables data will be stored in HDFS like below
/user/hive/warehouse//
/user/hive/warehouse//
/user/hive/warehouse//
/user/hive/warehouse//

If we know how data is coming from source of the file , If we implement filter condition using where condition
Then we will do the partitioning for the given data like below

/user/hive/warehouse///month-jan/ /user/hive/warehouse///month-feb/ /user/hive/warehouse///month-march/ /user/hive/warehouse///month-april/ Bucketing is used to improve the performance. What do we mean by Partitions? 1. Partitions means dividing a table into a coarse grained parts based on the value of a particular column such as date. 2. This make it faster to do queries on slices of the data.

Buckets or Clusters 1. Partitions divided further into buckets bases on some other column 2. Use for data sampling. Buckets:  1. Buckets give more extra structure to the data , that may be used for efficient queries.  2. A Join of two tables that are bucketed on the same columns – including the join column can be implemented as a Map Side Join.(Depending on hash value.)  3. Bucketing by user id means, we can easily and quickly evaluate a user based query by running it on a randomized sample of the total set of users. Now we will see how to work partition and bucketing 1. First create a table called transaction_records 2. For that, first create a database called retail Command: to create database

Hive> create database retail;

Command: to use database

Hive> use retail;

Now we need to create a table.

Hive> create table transaction_records(txnno INT,txndate STRING,custno INT,amount DOUBLE,category STRING, product STRING,City STRING,State String,Spendby String )
row format delimited fields terminated by ‘,’ stored as textfile;

How to load data into table?

Hive>  LOAD  DATA  LOCAL INPATH  ‘/usr/local/hive_demo/transaction/’  INTO  TABLE transaction_records;
Hive> select count(*) from transaction_records;

We can try different queries as like SQL. Ex: Aggregation: 1. select category,sum(amount) from transaction_records group by category; Grouping: 2. distinct(select (DISTINCT category ) from transaction_records; How to copy table data into another table or file or HDFS? 1. Insert output into another table

Insert overwite table results(select * from transaction_records);
 Create table results as select * from transaction_records;

2. Insert Output into local file.

Insert overwrite local directory ‘results’ select * from transaction_records;

3. Inserting output into HDFS

Insert overwrite directory  ‘/results’ select * from transaction_records;

How to write all queries in a single script file and execute the same? Hive Scripts are used to execute a set of Hive Commands collectively. This helps in reducing the time and effort invested in writing and executing each command manually. Hive support scripting from Hive 0.10.0 and above versions. Name file as hive_script.hql and place it where ever you like( here I keeping inside /usr/local/hive_demo/

use retail;
 create table transaction_records_script(txnno INT,txndate STRING,custno INT,amount DOUBLE,category STRING, product STRING,City STRING,State String,Spendby String )
row format delimited fields terminated by ‘,’ stored as textfile;
 LOAD  DATA  LOCAL INPATH  ‘/usr/local/hive_demo/transaction/’  INTO  TABLE transaction_records_ script;
Select count(*) from  transaction_records_ script;
select category,sum(amount) from  transaction_records group by category;

How to Run the hive script file. hive -f hive_script.hql OR hive -f hive_script.sql (if we named our script file as .sql then we can use this.) Hive Joins (table joining) Create a script to create tables called employee and email Before creating script we need to create 2 files(emp.txt,email.txt) and need to filled with data /usr/local/hive_demo/emp.txt

siva,56000,bangalore
raju,67000,chennai
arjun,25000,mumbai
sweety,54000,pune

/usr/local/hive_demo/email.txt

siva,siva@gmail.com
raju,raju@yahoo.com
arjun,arjun@aol.com
sweety,sweety@rediff.com
jatin,jatin@gmail.com
sneha,sneha@hotmail.com

Create a script to work with joining tables demo

Use retail;
Create table employee(name string,salary float,city string) row format delimited fields terminated  by ‘,’ ;
Load data local INPATH ‘/usr/local/hive_demo/emp.txt’ into table employee;
Create table email(name string,email string) row format delimited fields terminated by ‘,’;
Load data local inpath ‘/usr/local/hive_demo/email.txt’ into table email;

After creating the script now we need to run the hive_join_demo.hql file. hive -f hive_join_demo.hql Now we will work with joins: Inner join

Hive> select a.name,a.city,a.salary,b.email_id  from employee a  join email b on a.name=b.name;

It will display name,city ,salary and email id where matching condition between two tables; Left outer join

Hive> select a.name,a.city,a.salary,b.email_id  from employee a  LEFT OUTER join email b on a.name=b.name;

It will display all the records from first table and matching records from second table. Right outer join

Hive>select a.name,a.city,a.salary,b.email_id  from employee a  RIGHT OUTER join email b on a.name=b.name;

It will display all the records from second table and matching records from first table.

This is how we will work with hive sql joins.
Thank you very much for viewing this.