Apache Spark is a general-purpose cluster computing system to process big data workloads. It is very possible to use spark with Hadoop HDFS, Amazon EC2 and others persistence storage system including local file system. For leaning Apache spark, it is very possible to setup it in standalone mode and start executing spark API's in Scala,Python or R shell. In this post we will setup spark and execute some sparks API's.
Download Apache spark:-
Download pre-build version of Apache spark and unzip it in some directory. I have placed it in following location E:\spark-1.5.2-bin-hadoop2.6.
Note:- It is also possible to download source code and build using Maven or SBT.Refer this for other options of download.
Download and install Scala:-
Download Scala executables and install it.It is prerequisite for working with Apache spark, spark is written in Scala. Scala installed at "C:\Program Files (x86)\scala".
Set-up SCALA_HOME and HADOOP_HOME :-
Once we are done with the installation of Spark and Scala, configure environment variable for SCALA_HOME and HADOOP_HOME.
SCALA_HOME = C:\Program Files (x86)\scala
As of now we do not want to stick with Hadoop ,we just want to learn Apache spark. So we need to download winutils.exe and configure it as HAOOP_HOME.Unzip it and add path before bin directory as HADOOP_HOME.
HADOOP_HOME = E:\dev\hadoop\hadoop-common-2.2.0-bin-master
Update PATH environment variable :-
Add Spark bin directory in PATH environment variable so that Scala or python shell can be started without visiting bin directory every time.
Start Spark’s shells(Scala or Python version) :-
Python version : pyspark
Scala version : spark-shell
Start cmd,type pyspark and press enter. If we have followed steps properly, it should open Python version of the Spark shell and as shown below.
Similarly, we can start Scala version of the Spark shell by typing spark-shell and press enter in cmd.
Note:- Here we will get some error on console regarding hive directory write permission, we can ignore it,we can start executing spark API's and learn Apache spark.
Sample API's execution in python or scala shell :-
Create a RDD, display total number of lines in file and followed by first line of that file.
In Python version of the Spark shell
>>> lines = sc.textFile("README.md") # Create an RDD called lines
>>> lines.count() # Count the number of items in this RDD
98
>>> lines.first() # First item in this RDD, i.e. first line of README.md
u'# Apache Spark'
Note:- If you execute the same set of commands, console will be flooded with lines. I have suppressed it by changing log level to warning.
Download Apache spark:-
Download pre-build version of Apache spark and unzip it in some directory. I have placed it in following location E:\spark-1.5.2-bin-hadoop2.6.
Note:- It is also possible to download source code and build using Maven or SBT.Refer this for other options of download.
Download and install Scala:-
Download Scala executables and install it.It is prerequisite for working with Apache spark, spark is written in Scala. Scala installed at "C:\Program Files (x86)\scala".
Set-up SCALA_HOME and HADOOP_HOME :-
Once we are done with the installation of Spark and Scala, configure environment variable for SCALA_HOME and HADOOP_HOME.
SCALA_HOME = C:\Program Files (x86)\scala
As of now we do not want to stick with Hadoop ,we just want to learn Apache spark. So we need to download winutils.exe and configure it as HAOOP_HOME.Unzip it and add path before bin directory as HADOOP_HOME.
HADOOP_HOME = E:\dev\hadoop\hadoop-common-2.2.0-bin-master
Update PATH environment variable :-
Add Spark bin directory in PATH environment variable so that Scala or python shell can be started without visiting bin directory every time.
Start Spark’s shells(Scala or Python version) :-
Python version : pyspark
Scala version : spark-shell
Start cmd,type pyspark and press enter. If we have followed steps properly, it should open Python version of the Spark shell and as shown below.
Similarly, we can start Scala version of the Spark shell by typing spark-shell and press enter in cmd.
Note:- Here we will get some error on console regarding hive directory write permission, we can ignore it,we can start executing spark API's and learn Apache spark.
Sample API's execution in python or scala shell :-
Create a RDD, display total number of lines in file and followed by first line of that file.
In Python version of the Spark shell
>>> lines = sc.textFile("README.md") # Create an RDD called lines
>>> lines.count() # Count the number of items in this RDD
98
>>> lines.first() # First item in this RDD, i.e. first line of README.md
u'# Apache Spark'
Note:- If you execute the same set of commands, console will be flooded with lines. I have suppressed it by changing log level to warning.
Really you have enclosed very good informations.please furnish more informations in future.
ReplyDeleteHadoop Training in Chennai
Big data training in chennai
Hadoop Training in Anna Nagar
JAVA Training in Chennai
Python Training in Chennai
Selenium Training in Chennai
Hadoop training in chennai
Big data training in chennai
big data course in chennai
Mua vé máy bay tại Aivivu, tham khảo
ReplyDeletevé máy bay đi Mỹ giá rẻ
vé máy bay từ los angeles về việt nam
vé máy bay từ canada về việt nam
é bay từ nhật về việt nam
Lịch bay từ Seoul đến Hà Nội
Vé máy bay từ Đài Loan về VN
khách sạn cách ly nha trang