Apache spark is an in-memory computation framework in Hadoop ecosystem. Apache spark allows developer to write application code in Scala,Python,R and Java.The main agenda of this post is to write a spark application in Scala and deploy using SBT(Scala build tool).
Prerequisite :- Apache spark and Scala should be installed. Here I am using Spark-1.5.2 and Scala 2.10.6. First we will install SBT and followed by configuring assembly plugin required for build and then create sample spark application.Internet connection is mandatory for packaging project first time.
How to check spark and Scala is set up or not ?
1. SBT installation:-
SBT is an open source build tool for Scala and Java projects, similar to Java's Maven or Ant.SBT is the de facto build tool for the Scala community.Execute following command to download SBT tar ball and extract it.
export SBT_HOME=/opt/sbt/
export PATH=$SBT_HOME/bin:$PATH
2. Install sbt assembly plugin:-
sbt-assembly is an sbt plugin that creates a JAR out of our project with all of its dependencies except Hadoop and spark dependencies(These are termed as provided dependencies and provided by cluster itself at runtime).SBT manages a plugin definition file and we need to make entry in that file for any new entry(similar to pom.xml in Maven).
There are two ways to add sbt-assembly to plugin definition file (if existing or create one if doesn’t exist). we can use either :
3. Creating sample spark application:- Word count example
Load an input file and create an RDD. Count all words and display on console using collect() method.
5. Deploy generated jar/Submit job to spark cluster:-
spark-submit(present in <SPARK_HOME>/bin) executable is used to submit job in spark cluster.Use following command. Download input file from here and place it in home directory.
Explanation of word count example :- On applying flatmap unction on RDD test, each line is split with respect to space and array of string is obtained. This string array is converted into map with each word of list as key and 1 as value (collection of tuple is produced).Finally, reduceByKey is applied on for each tuple and aggregated output (unique word and corresponding count) is written to file. Lets take an example and understand the flow of method used in the above program unit.Suppose input.txt has two lines :
This is spark time
Learn spark
An eclipse project can also be created using sbt, just we need to add an entry for sbt-eclipse in plugin configuration file in ~/.sbt/0.13/plugins/plugins.sbt
Download scala IDE:-
Execute following commands to download and extract tar ball.
Prerequisite :- Apache spark and Scala should be installed. Here I am using Spark-1.5.2 and Scala 2.10.6. First we will install SBT and followed by configuring assembly plugin required for build and then create sample spark application.Internet connection is mandatory for packaging project first time.
How to check spark and Scala is set up or not ?
zytham@ubuntu:~$ spark-shell Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.2 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. SQL context available as sqlContext.Note:- If SPARK_HOME/bin is not in path variable, go to SPARK_HOME/bin and execute spark-shell command. If you do not get prompt, first install Scala and Apache spark, then follow this tutorial.
1. SBT installation:-
SBT is an open source build tool for Scala and Java projects, similar to Java's Maven or Ant.SBT is the de facto build tool for the Scala community.Execute following command to download SBT tar ball and extract it.
zytham@ubuntu:~$ wget https://dl.bintray.com/sbt/native-packages/sbt/0.13.8/sbt-0.13.8.tgz ..... Length: 1059183 (1.0M) [application/unknown] Saving to: ‘sbt-0.13.8.tgz’ 100%[======================================>] 10,59,183 17.0KB/s in 26s 2016-01-09 21:49:11 (39.5 KB/s) - ‘sbt-0.13.8.tgz’ saved [1059183/1059183]Extract tar ball using following command.
zytham@ubuntu:~$ tar -xvf sbt-0.13.8.tgz sbt/ sbt/conf/ sbt/conf/sbtconfig.txt sbt/conf/sbtopts sbt/bin/ sbt/bin/sbt.bat sbt/bin/sbt sbt/bin/sbt-launch.jar sbt/bin/sbt-launch-lib.bashMove extracted files at some location and verify using all SBT files are in place.
zytham@ubuntu:~$ sudo mv sbt /opt zytham@ubuntu:~$ cd /opt/ zytham@ubuntu:/opt$ ls data drill eclipse.desktop sbt spark zookeeper datastax-ddc-3.2.1 eclipse gnuplot-5.0.1 scala2.10 spark-1.5.2In order to create and build projects from any directory using sbt, we nee do add sbt executable into the PATH shell variable. Add sbt bin in path variable in bashrc file using.
zytham@ubuntu:/opt/spark-1.5.2/bin$ gedit ~/.bashrcAdd these two lines at the end of the file.
export SBT_HOME=/opt/sbt/
export PATH=$SBT_HOME/bin:$PATH
2. Install sbt assembly plugin:-
sbt-assembly is an sbt plugin that creates a JAR out of our project with all of its dependencies except Hadoop and spark dependencies(These are termed as provided dependencies and provided by cluster itself at runtime).SBT manages a plugin definition file and we need to make entry in that file for any new entry(similar to pom.xml in Maven).
There are two ways to add sbt-assembly to plugin definition file (if existing or create one if doesn’t exist). we can use either :
- the global file (for version 0.13 and up) at ~/.sbt/0.13/plugins/plugins.sbt
OR - the project-specific file at PROJECT_HOME/project/plugins.sbt
zytham@ubuntu:/opt$ mkdir -p ~/.sbt/0.13/plugins zytham@ubuntu:/opt$ cat >> ~/.sbt/0.13/plugins/plugins.sbt addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")
3. Creating sample spark application:- Word count example
Load an input file and create an RDD. Count all words and display on console using collect() method.
- Create a project directory name it as "WordCountExample" followed by directory structure /src/main/scala/
zytham@ubuntu:~$ mkdir WordCountExample zytham@ubuntu:~$ cd WordCountExample/ zytham@ubuntu:~/WordCountExample$ mkdir -p src/main/scala
- Create a scala file with following code lines.
zytham@ubuntu:~$ cd src/main/scala zytham@ubuntu:~/WordCountExample/src/main/$ gedit Wordcount.scala
Copy below sample code lines in Wordcount.scalaimport org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD.rddToPairRDDFunctions object WordCount { def main(args: Array[String]) = { //Start the Spark context val conf = new SparkConf() .setAppName("WordCount") .setMaster("local") val sc = new SparkContext(conf) //Read some example file to a test RDD val test = sc.textFile("input.txt") test.flatMap { line => //for each line line.split(" ") //split the line in word by word. } .map { word => //for each word (word, 1) //Return a key/value tuple, with the word as key and 1 as value } .reduceByKey(_ + _) //Sum all of the value with same key .saveAsTextFile("output.txt") //Save to a text file //Stop the Spark context sc.stop } }
- In project home directory create a .sbt configuration file with following lines.
zytham@ubuntu:~/WordCountExample/src/main/scala$ cd ~/WordCountExample/ zytham@ubuntu:~/WordCountExample$ gedit WordcountExample.sbt
Configuration file lines
name := "WordCount Spark Application" version := "1.0" scalaVersion := "2.10.6" libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.2"
- View project directory structure and files.
zytham@ubuntu:~/WordCountExample$ find . . ./WordcountExample.sbt ./src ./src/main ./src/main/scala ./src/main/scala/Wordcount.scala
zytham@ubuntu:~/WordCountExample$ sbt package [info] Loading global plugins from /home/zytham/.sbt/0.13/plugins ..... [info] Compiling 1 Scala source to /home/zytham/WordCountExample/target/scala-2.10/classes... [info] Packaging /home/zytham/WordCountExample/target/scala-2.10/wordcount-spark-application_2.10-1.0.jar ... [info] Done packaging. [success] Total time: 101 s, completed Jan 31, 2016 11:42:25 AMNote:- It may take some time, since it downloads some jar files and internet connection is mandatory. On successful build it creates a jar file(wordcount-spark-application_2.10-1.0.jar) at location "<Project_ome>/target/scala-2.10". (Name of directory and jar file might be different depending on what we have configured in configuration file Wodcountexample.sbt)
5. Deploy generated jar/Submit job to spark cluster:-
spark-submit(present in <SPARK_HOME>/bin) executable is used to submit job in spark cluster.Use following command. Download input file from here and place it in home directory.
zytham@ubuntu:~/WordCountExample$ spark-submit --class "WordCount" --master local[2] target/scala-2.10/wordcount-spark-application_2.10-1.0.jarOn successful execution, an output directory is created with name "ouput.txt" and file part-00000 contains (word and count) pairs.Execute following command to see output and verify the same.
zytham@ubuntu:~/WordCountExample$ cd output.txt/ zytham@ubuntu:~/WordCountExample/output.txt$ ls part-00000 _SUCCESS zytham@ubuntu:~/WordCountExample/output.txt$ head -10 part-00000 (spark,2) (is,1) (Learn,1) (This,1) (time,1)
Explanation of word count example :- On applying flatmap unction on RDD test, each line is split with respect to space and array of string is obtained. This string array is converted into map with each word of list as key and 1 as value (collection of tuple is produced).Finally, reduceByKey is applied on for each tuple and aggregated output (unique word and corresponding count) is written to file. Lets take an example and understand the flow of method used in the above program unit.Suppose input.txt has two lines :
This is spark time
Learn spark
Flow of method's used in word count example |
addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "4.0.0")and using "sbt eclipse" command instead of "sbt package" eclipse project can be created.
zytham@ubuntu:~/WordCountExample$ sbt eclipse [info] Loading global plugins from /home/zytham/.sbt/0.13/plugins [info] Set current project to WordCount Spark Application (in build file:/home/zytham/WordCountExample/) [info] About to create Eclipse project files for your project(s). [info] Successfully created Eclipse project files for project(s): [info] WordCount Spark ApplicationNow in scala IDE, we can import this spark application and execute it from there too.
Download scala IDE:-
Execute following commands to download and extract tar ball.
zytham@ubuntu:~/Downloads$ wget http://downloads.typesafe.com/scalaide-pack/4.1.1-vfinal-luna-211-20150728/scala-SDK-4.1.1-vfinal-2.11-linux.gtk.x86_64.tar.gz zytham@ubuntu:~/Downloads$ tar -xvf scala-SDK-4.1.1-vfinal-2.11-linux.gtk.x86_64.tarFor running eclipse IDE, execute following command form the directory where it has been extracted.
zytham@ubuntu:~/Downloads$ ~/eclipse/eclipse
This is really such a great article. I really enjoyed it. Thank you for sharing.
ReplyDeleteFinal Year Projects for CSE
Corporate TRaining Spring Framework
Project Centers in Chennai For CSE
Spring Training in Chennai
You can check the PARCC practice test grade 5 to make yourself better at math. It will make more effective at your exam. Check out here for several types of questions.
ReplyDeleteĐặt vé tại phòng vé Aivivu, tham khảo
ReplyDeletevé máy bay đi Mỹ hạng thương gia
vé máy bay từ mỹ về việt nam bao nhiêu tiền
mua ve may bay gia re tu duc ve viet nam
chuyến bay từ nga về việt nam hôm nay
lịch bay từ anh về việt nam hôm nay
vé máy bay từ pháp về việt nam
danh sách khách sạn cách ly tại quảng ninh
These security products and solutions offer a lot of value for the price. Regardless of which subscription plan is right for you, you can expect a good deal. Latest Cybersecurity Infographics
ReplyDelete