Learn and shine: Getting Started with Spark and word count example using sparkcontext

Friday, June 3, 2016

Getting Started with Spark and word count example using sparkcontext

Step 1: Download Spark DownLoad Spark From Here
1.Choose Spark Release < which ever version you wan to be work>
2.Choose Packe Type < Any version of hadoop>
3.Choose Download type
4.Click on the Download Spark.

Step 2: After successful Download, we need to run the spark.
For that , we need to follow few steps.

1.Install Java 7 and set the PATH and JAVA_HOME in environment variables.
2.Download Hadoop version< Here I have downloaded hadoop 2.4>
3. Untar the tar file and set the HADOOP_HOME and update the PATH in environemnt varaibles.
4.If Hadoop not installed then download the winutils.exe file and save in your local system.
(This is to work with windows environment)
5.After downloading set the HADOOP_HOME in environment variables where our winutils.exe file resides.

Step 3: Once everything has been done, then now we need to check spark has been working or not.
1.Go to command Prompt
C:/>spark-shell

spark will start and with lot of logs, to avoid info logs we need to change the log level.

Step 4: Go to conf inside spark
1.Copy log4j.properties.template and paste in same location and edit the same.
2.Change the INFO level to ERROR level and rename it has log4j.properties
log4j.rootCategory=INFO, console change as
log4j.rootCategory=ERROR, console

Step 5: After changing the Log level, if we try to run spark-shell, again from command prompt, then you can see the difference.
1.This is How we can install the Spark in windows environment.
2.If you are facing any issues while starting the spark.
3.First check the Hadoop home path by using the following command

C:> echo %HADOOP_HOME%

4.It should print Hadoop home path where our winutils.exe file is available
5.Set the permissions for the hadoop temp folder, provide the permissions

           C:> %HADOOP_HOME%\bin\winutils.exe ls \tmp\hive
           C:> %HADOOP_HOME%\bin\winutils.exe  chmod 777  \tmp\hive

Step 6: Now we will check word count example using spark.
How we usally do in Hadoop Map reduce to count the words in the given file.

1.After spark-shell started we will get 2 contexts, one is Spark Context (sc), SQL Context as sqlContext.
2.Using the spark context sc, we will read the files, and do the manipulation and write output to the file.

          val textFile = sc.textFile(“file:///C:/spark/spark-1.5.0-bin-hadoop2.4/README.md”)

          //to read the first line of the file
          textFile.first

          //Split the each line data using space as delimeter
         val tokenizedFileData = textFile.flatMap(line=>line.split(“ “))

         //Prepare Counts using map
         val countPrep = tokenizedFileData.map(word=>(word,1))

         //Check the counts using reduceByKey
         val  counts = countPrep.reduceByKey((accumValue,newValue)=>accumValue+newValue)

         //sort the values using key value pair
         val  sortedCounts = counts.sortBy(kvPair=>kvPair._2,false)

         //Save the sorted counts into outfile calles ReadMeWordCount
         sortedCounts.saveAsTextFile(file:///C:/spark/ReadMeWordCount)

        //If we want to show countByValue(built in mapreduce)
        tokenizedFileData.countByValue