Learn and shine

Monday, March 21, 2016

Apache Hive Advanced topics

This post will describe more concepts in Hive
Partitions:
1. How data is stored in HDFS
2. Grouping databases on some column
3. Can have one or more columns.
How partitioning will work?
Usually tables data will be stored in HDFS like below
/user/hive/warehouse//
/user/hive/warehouse//
/user/hive/warehouse//
/user/hive/warehouse//

If we know how data is coming from source of the file , If we implement filter condition using where condition
Then we will do the partitioning for the given data like below

/user/hive/warehouse///month-jan/ /user/hive/warehouse///month-feb/ /user/hive/warehouse///month-march/ /user/hive/warehouse///month-april/ Bucketing is used to improve the performance. What do we mean by Partitions? 1. Partitions means dividing a table into a coarse grained parts based on the value of a particular column such as date. 2. This make it faster to do queries on slices of the data.

Buckets or Clusters 1. Partitions divided further into buckets bases on some other column 2. Use for data sampling. Buckets:  1. Buckets give more extra structure to the data , that may be used for efficient queries.  2. A Join of two tables that are bucketed on the same columns – including the join column can be implemented as a Map Side Join.(Depending on hash value.)  3. Bucketing by user id means, we can easily and quickly evaluate a user based query by running it on a randomized sample of the total set of users. Now we will see how to work partition and bucketing 1. First create a table called transaction_records 2. For that, first create a database called retail Command: to create database

Hive> create database retail;

Command: to use database

Hive> use retail;

Now we need to create a table.

Hive> create table transaction_records(txnno INT,txndate STRING,custno INT,amount DOUBLE,category STRING, product STRING,City STRING,State String,Spendby String )
row format delimited fields terminated by ‘,’ stored as textfile;

How to load data into table?

Hive>  LOAD  DATA  LOCAL INPATH  ‘/usr/local/hive_demo/transaction/’  INTO  TABLE transaction_records;
Hive> select count(*) from transaction_records;

We can try different queries as like SQL. Ex: Aggregation: 1. select category,sum(amount) from transaction_records group by category; Grouping: 2. distinct(select (DISTINCT category ) from transaction_records; How to copy table data into another table or file or HDFS? 1. Insert output into another table

Insert overwite table results(select * from transaction_records);
 Create table results as select * from transaction_records;

2. Insert Output into local file.

Insert overwrite local directory ‘results’ select * from transaction_records;

3. Inserting output into HDFS

Insert overwrite directory  ‘/results’ select * from transaction_records;

How to write all queries in a single script file and execute the same? Hive Scripts are used to execute a set of Hive Commands collectively. This helps in reducing the time and effort invested in writing and executing each command manually. Hive support scripting from Hive 0.10.0 and above versions. Name file as hive_script.hql and place it where ever you like( here I keeping inside /usr/local/hive_demo/

use retail;
 create table transaction_records_script(txnno INT,txndate STRING,custno INT,amount DOUBLE,category STRING, product STRING,City STRING,State String,Spendby String )
row format delimited fields terminated by ‘,’ stored as textfile;
 LOAD  DATA  LOCAL INPATH  ‘/usr/local/hive_demo/transaction/’  INTO  TABLE transaction_records_ script;
Select count(*) from  transaction_records_ script;
select category,sum(amount) from  transaction_records group by category;

How to Run the hive script file. hive -f hive_script.hql OR hive -f hive_script.sql (if we named our script file as .sql then we can use this.) Hive Joins (table joining) Create a script to create tables called employee and email Before creating script we need to create 2 files(emp.txt,email.txt) and need to filled with data /usr/local/hive_demo/emp.txt

siva,56000,bangalore
raju,67000,chennai
arjun,25000,mumbai
sweety,54000,pune

/usr/local/hive_demo/email.txt

siva,siva@gmail.com
raju,raju@yahoo.com
arjun,arjun@aol.com
sweety,sweety@rediff.com
jatin,jatin@gmail.com
sneha,sneha@hotmail.com

Create a script to work with joining tables demo

Use retail;
Create table employee(name string,salary float,city string) row format delimited fields terminated  by ‘,’ ;
Load data local INPATH ‘/usr/local/hive_demo/emp.txt’ into table employee;
Create table email(name string,email string) row format delimited fields terminated by ‘,’;
Load data local inpath ‘/usr/local/hive_demo/email.txt’ into table email;

After creating the script now we need to run the hive_join_demo.hql file. hive -f hive_join_demo.hql Now we will work with joins: Inner join

Hive> select a.name,a.city,a.salary,b.email_id  from employee a  join email b on a.name=b.name;

It will display name,city ,salary and email id where matching condition between two tables; Left outer join

Hive> select a.name,a.city,a.salary,b.email_id  from employee a  LEFT OUTER join email b on a.name=b.name;

It will display all the records from first table and matching records from second table. Right outer join

Hive>select a.name,a.city,a.salary,b.email_id  from employee a  RIGHT OUTER join email b on a.name=b.name;

It will display all the records from second table and matching records from first table.

This is how we will work with hive sql joins.
Thank you very much for viewing this.

Monday, March 7, 2016

Getting started with Apache Hive

This post will explain below points.
1. How to install and configure Hive on Ubuntu.
2. How to create a table using HIVE.
3. How to load local data and HDFS external data.
4. Basic SQL commands usage in Hive

Step 1: Download latest hive tar file from the below link
https://hive.apache.org/downloads.html
Command: untar the file using below command

/usr/local> tar –xvzf  /usr/local/

Step 2: Once tar has been completed. Then we need to do some configurations to start the HIVE.

Command:to edit the bashrc file

sudo gedit  ~/.bashrc

Step 3: Add the below configuration detail in bashrc file

       export  HIVE_HOME=”/usr/local/ apache-hive-1.2.1-bin”
       export PATH= $PATH:$HIVE_HOME/bin
      export HADOOP_USER_CLASSPATH_TEST=true
     export PATH

Step 4: to avoid [ERROR] Terminal initialization failed; falling back to unsupported java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected at jline , below ling of configuration will help.

export HADOOP_USER_CLASSPATH_TEST=true

Step 5: We need to add configuration in hive-config.sh file.
Command : To add the hadoop home configuration in hive-config.sh

       cd  /usr/local/apache-hive-1.2.1-bin/bin
       sudo gedit hive-config.sh

Add the below configuration in hive-config.sh

       export HADOOP_HOME=/usr/local/hadoop

Step 6: Once above configurations completed then we need to start the hive

use hive keyword in terminal, then it will open the hive shell for you.

Step 7: This is how we will install and configure HIVE.
Now we are ready to work with HIVE.

Step 8: To know the databases available in hive?
Hive>show databases;
Step 9: To know the tables, which is available in hive?
Hive> show tables;
Step 10: How to create database in Hive?
Hive> create database cricket;
Step 11: How to use created database?
Hive> use cricket;
Step 12: How to create a table inside cricket database

       Hive> create table matchscore(
                                          match_name string,
                                          match_score int,
                                         match_location string
                                      ) row format delimited fields terminated by  ‘,’  ;

Now we have created database successfully. We need to verify whether database created or not.

open another terminal and go up to /user/local>

Step 13: How to Know the database created or not?
$usr/local> hadoop fs –ls /user/hive/warehouse

Step 14: How to Know the database table created or not?

$usr/local> hadoop fs –ls /user/hive/warehouse/cricket.db
Now we have created database and table successfully and verified the same.
We need to insert the data into respective tables.
Now How we will load the data into hive tables.

first create a file in local directory inside /usr/local/hive_demo , If hive_demo dir is not there then create the same.
Step 15: How to create file?
$usr/local/hive_demo> sudo gedit matchinfo.txt

Once we created this file, then we need to load the same into hive table, Go to HIVE shell

Step 16: How to load the data from local system to Hive table

    Hive> LOAD DATA  LOCAL INPATH  ‘/usr/local/hive_demo/matchinfo.txt’  INTO  TABLE matchscore;

Once we have loaded the file, if we want to check ,whether the file has been created inside respective database table or not
Go to terminal /usr/local
Step 17: How to check table data loaded into respective table or not?
$usr/local> hadoop fs –ls /user/hive/warehouse/cricket.db/matchscore

Step 18: How to verify the data has been loaded into Hive table or not
Hive>select * from matchscore;

This is how we will load the local data into Hive tables.
Now we need to check how will load HDFS data into HIVE tables
We can edit the existing file and add the more details to the matchinfo_details.txt file

Step 19: Create HDFS directory
$usr/local> hadoop fs –mkdir -p /usr/local/hive_demo/input

Step 20 :How to put a file in HDFS?

$usr/local>hadoop fs –put /usr/local/hive_demo/ matchinfo_details.txt /usr/local/hive_demo/input/

Now we have created hdfs directory and added the file into HDFS directory.
Step 21: How we will load data into Hive tables?

    Hive> create EXTERNAL table matchscore_result(
                                                      match_name string,
                                                       match_score int,
                                                        match_location string,
                                                       match_result    string)
                              row  format delimited fields terminated by  ‘,’
                               LOCATION ‘/usr/local/hive_demo/input’;

We have successfully loaded the external file data into Hive table.
to check the table data use the select * from matchscore_result from the Hive shell.
Advantage with this external loading is , if we modified the existing file and, again we have kept the updated file into HDFS,
then no need to load the data again into hive, simply we can use select * from matchscore_result. We will get the updated results.

Step 22: How to describe the table structure?
Hive> describe formatted matchscore;

Step 23: How to rename the existing table?
Hive> alter table matchscore rename to matchscore_altered;

Step 24: How to show the updated table list?
Hive> show tables;

This is how we can install and work with Hive basics.
Thank you for viewing this post.

Sunday, February 28, 2016

Apache Hive Basics

Hive Back ground

1. Hive Started at Facebook.
2. Data was collected by cron jobs every night into Oracle DB.
3. ETL via hand-coded python
4. Grew from 10s of GBs(2006) to 1TB/day new data in 2007 , now 10x that

Facebook usecase
1. Facebook uses more than 1000 million users
2. Data is more than 500 TB per day
3. More than 80k queries for day
4. More than 500 million photos per day.

5. Traditional RDBS will not the right solution, to do the above activities.
6. Hadoop Map Reduce is the one to solve this.
7. But Facebook developers having lack of java knowledge to code in Java.
8. They know only SQL well.
So They introduced Hive
Hive
1. Tables can be partitioned and bucketed.
Partitioned and bucketed are used for performance
2. Schema flexibility and evolution
3. Easy to plugin custom mapper reducer code
4. JDBC/ODBC Drivers are available.
5. Hive tables can be directly defined on HDFS
6. Extensible : Types , formats, Functions and scripts.
What do we mean by Hive
1. Data warehousing package built on top of hadoop.
2. Used for Data Analytics
3. Targeted for users comfortable with SQL.
4. It is same as SQL , and it will be called as HiveQL.
5. It is used for managing and querying for structured data.
6. It will hide the complexity of Hadoop
7. No need to learn java and Hadoop API’s
8. Developed by Facebook and contributed to community.
9. Facebook analyse Tera bytes of data using Hive.

Hive Can be defined as below
• Hive Defines SQL like Query language called QL
• Data warehouse infrastructure
• Allows programmers to plugin custom mappers and reducers.
• Provides tools to enable easy to data ETL
Where to use Hive or Hive Applications?
1. Log processing
2. Data Mining
3. Document Indexing
4. Customer facing business intelligence
5. Predective Modeling and hypothesis testing
Why we go for Hive
1. It is SQL like types and if we provide explicit schema and types.
2. By using Hive we can partition the data
3. It has own Thrift sever, we can access data from other places.
4. Hive will support serialization and deserialization
5. DFS access can be accessed implicitly.
6. It supports Joining , Ordering and Sorting
7. It will support own Shell hive script
8. It is having web interface
Hive Architecture

1. Hive data will be stored in Hadoop File System.
2. All Hive meta data like schema name, table structure,view name all the details will be stored in Metastore
3. We will Hive Driver, it will take the request and compile and convert into hadoop understanding language and execute the same.
4. Thrift server is will access hive and fetch data from DFS.

Hive Components

Hive Limitations
1. Not designed for online transaction processing.
2. Does not offer real time queries and row level updates
3. Latency for Hive query’s is high(It will take minutes to process)
4. Provides acceptable latency for interactive data browsing
5. It is not suitable for OLTP type applications.
Hive Query Language Abilities

What is the traditional RDBMS and Hive differences
1. Hive will not verify the data when it is loaded, but it is do at the time of query issued.
2. Schema on read makes very fast initial load. The file operation is just a file copy or move.
3. No updates , Transactions and indexes.
Hive support data types

Hive Complex types:
Complex types can be built up from primitive types and other composite types using the below operators.

Operators
1. Structs: It can be accessed using DOT(.) notation
2. Maps: (Kye-value tuples), it can be accessed using [element-name] as notation
3. Arrays: (Indexable lists) Elements can be accessed using the [n] notation, where n is an index (zero –based) into the array.
Hive Data Models
1. Data Bases
Namespaces – ex: finance and inventory database having Employee table 2 different databases
2. Tables
Schema in namespaces
3. Partitions
How data is stored in HDFS
Grouping databases on some columns
Can have one or more columns
4. Buckets and Clusters
Partitions divided further into buckets on some other column
Use for data sampling

Hive Data in the order of granularity

Buckets
Buckets give extra structure to the data that may be used for more efficient queries
A join of two tables that are bucketed on the same columns – including the join column can be implemented as Map Side Join
Bucketing by user ID means we can quickly evaluate a user based query by running it on a randomized sample of the total set of users.

These are the basics about Hive.

Thank you for viewing the post.

Thursday, February 25, 2016

Clickjacking prevention using X Frame Options and J2EE Filter

1. What is Clickjacking.
It is also known as User Interface redress attack, UI redress attack, UI redressing
It is a malicious technique of tricking a Web user into clicking on something different from what the user perceives they are clicking on, thus potentially revealing confidential information or taking control of their computer while clicking on seemingly innocuous web pages. It is a browser security issue that is a vulnerability across a variety of browsers and platforms
2. How to prevent Clickjacking using Filter in java
Below example shows how Clickjacking will happens and how we can prevent the same.

Here I have created a Simple LoginServlet , after successful login, page will be redirected to success page.
Everyone knows how to create servlet and deploy the same. But still I am writing here to understand who have no idea how to create.
Step 1: Start eclipse
Step2: create a Dynamic Web Project -> clickjacking_prevention
Step3: first we need to create a login.jsp page, under Webcontent of the project

<%@ page language="java" contentType="text/html; charset=ISO-8859-1"
    pageEncoding="ISO-8859-1"%>




Login page


              User Name                   
          Password

Step 4: Need to create a success page

<%@ page language="java" contentType="text/html; charset=ISO-8859-1"
    pageEncoding="ISO-8859-1"%>




Login Success


                 Login Successful        
You can construct page as you like

Step 5: Now we need to create a LoginServlet

package com.siva;

import java.io.IOException;

import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

public class LoginServlet extends HttpServlet{

 /**
  * 
  */
 private static final long serialVersionUID = 1L;

 public void doPost(HttpServletRequest request, HttpServletResponse response)
   throws ServletException, IOException {

  String username = request.getParameter("username");
  String password = request.getParameter("password");
  if("siva".equalsIgnoreCase(username)&& "raju".equalsIgnoreCase(password)){
   System.out.println("inside if condition");
   response.sendRedirect("loginSuccess.jsp");
  }
 }
}

Step 6: Now we need to do Configuration in web.xml for LoginServlet



  clickjacking_prevention
  
    login.jsp
   
  
    
  
    LoginServlet
    com.siva.LoginServlet
  
  
   LoginServlet
   /loginServlet

Step 7: Once this configuration done, Now we can run the project using any of the servers like Apache tomcat or Jboss.
You can use the http://localhost:8080/clickjacking_prevention/

It will open page like above and you can enter username as siva and password as raju, then submit,
You can redirected to loginSuccess page

Create a html file and provide name as you like and paste the below code.



  click jaking

Once we run this html file we can see the same data which is showed in the loginSuccess page

Step 10 : Now we can see the difference between above two images. One is url page and one is iframe constructed page, both are same.
So hacker can use this , and patch in your actual site and steal the data.
Now How to prevent this.
We need to add this code in our filter or jsp page.
response.addHeader("X-FRAME-OPTIONS", “DENY” );
Here I have written Filter to overcome clickjacking

package com.siva;

import java.io.IOException;

import javax.servlet.Filter;
import javax.servlet.FilterChain;
import javax.servlet.FilterConfig;
import javax.servlet.ServletException;
import javax.servlet.ServletRequest;
import javax.servlet.ServletResponse;
import javax.servlet.http.HttpServletResponse;



public class ClickjackingPreventionFilter implements Filter 
{
  private String mode = "DENY";
  
// Add X-FRAME-OPTIONS response header to tell any other browsers who   not to display this //content in a frame.
     public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException {
         HttpServletResponse res = (HttpServletResponse)response;
         res.addHeader("X-FRAME-OPTIONS", mode );   
         chain.doFilter(request, response);
     }
     public void destroy() {
     }
     
     public void init(FilterConfig filterConfig) {
         String configMode = filterConfig.getInitParameter("mode");
         if ( configMode != null ) {
             mode = configMode;
         }
     }
}

Step 11: Once Filter has completed now we need to add same filter configuration in web.xml file


        ClickjackPreventionFilterDeny
        com.siva.ClickjackingPreventionFilter
        
            modeDENY
    
    
    
     
        ClickjackPreventionFilterDeny
        /*

Once we have done configuration , you can run the same Iframe example again, you can see the below page without any content, it will show warning in IE and it will not show any details in other browser.

This is how we can prevent the clickjacking attacks.
Thank you for viewing the post.