There are various Hadoop installation guide spread over the internet, initially struggled to install hadoop as one post was not sufficient to do successful installation. I referred multiple and compiled this post with all minute details required for installation for my reference and others too.
Hadoop installation as single node cluster can be divided into following subtask. First three subtasks are prerequisite for hadoop installation.
1. Java installation - Remember, here I struggled a bit, many places it is instructed install sun java 6 however, it is not possible to install Sun Java 6 or 7 now because Oracle (Sun) Java 6 is no longer available to be distributed by Ubuntu due to license issues. We have two option now either install Open JDK or Oracle JDK. I preferred Oracle JDK because it is less buggy than open jdk and it is free too, just we need to manually install it.
2. Create a dedicated Hadoop user:- In order to distinguish a system user and normal user, it is recommended to create a new user with limited access and do hadoop installation in it. Unix is known for giving only that much access as required.
3. SSH configuration and setup:- Since hadoop requires ssh for managing its nodes and do communication with them. So we need to configure ssh.
4. Apache Hadoop distribution setup:- Download Hadoop package and setup various configuration file for task tracker and job tracker.
Lets start with Java installation and followed by other subtasks. At the end of this post I believe we will in position to run "The most famous word count sample program" in our single node hadoop cluster.
Java installation in Ubuntu:-
Java installation in Ubuntu has been discussed separately Oracle JDK installation in Ubuntu. Follow this post and install Java in your system. If we have installed Java correctly in our system, execution of commad : "java - version" should display following output (JDK version and all might be different);.
zytham@ubuntu:~$ java -version
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
Create a dedicated Hadoop user:-
It is recommended to use a dedicated Hadoop user account for running Hadoop because Hadoopinstallation can be segregated from other software applications and user accounts running on the same machine. (Author Tom white in "Hadoop a definitive guide" advocates, it is good practice to create separate UNIX user for each hadoop processes and services- HDFS, MapReduce,YARN)
Here we will create a new user and run all services in that user context. First we create a group and add a new user in that group. Follow the following sequence of command execution.Remember, in order to create user and group ,we need to privileged user.
zytham@ubuntu:~$ sudo addgroup hadoop
[sudo] password for zytham:
Adding group `hadoop' (GID 1001) ...
Done.
zytham@ubuntu:~$ sudo adduser --ingroup hadoop hduser1
Adding user `hduser1' ...
Adding new user `hduser1' (1001) with group `hadoop' ...
Creating home directory `/home/hduser1' ...
Copying files from `/etc/skel' ...
Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfully
Changing the user information for hduser1
Enter the new value, or press ENTER for the default
Full Name []: HDUSER1
Room Number []: 2
Work Phone []: NA
Home Phone []: NA
Other []: NA
Is the information correct? [Y/n] Y
SSH configuration and setup SSH certificates:-
Hadoop control scripts requires SSH to manage nodes and performs cluster wide operation. without SSH configuration hadoop cluster operation can be performed (using dedicated shell or dedicated hadoop application), however using SSH (by generating public and private key and stored in file system that is shared across cluster) password less login facility to hdfs,yarn user can be provided seamlessly.Since we have single node set-up, we have to configure SSH access to localhost for the hduser1 user( created in the previous section). SSH consist of two component - SSH client and SSH demon .
ssh : The command we use to connect to remote machines - the client.
sshd : The daemon that is running on the server and allows clients to connect to the server.
SSH installation:-
In order to make demon SSHD work, we need to install SSH. Generally ssh is already enabled in ubuntu , just we need to install. Execute following command to install ssh.
zytham@ubuntu:~$ sudo apt-get install ssh
Reading package lists... Done
Building dependency tree
......
Package 'ssh' has no installation candidate
zytham@ubuntu:~$ sudo apt-add-repository "deb http://archive.ubuntu.com/ubuntu precise main restricted"
zytham@ubuntu:~$ sudo apt-get update
zytham@ubuntu:~$ sudo apt-get install openssh-client=1:5.9p1-5ubuntu1
......
Setting up openssh-client (1:5.9p1-5ubuntu1) ...
Installing new version of config file /etc/ssh/moduli ...
zytham@ubuntu:~$ sudo apt-get install openssh-server
Reading package lists... Done
.....
ssh start/running, process 12267
Setting up ssh-import-id (2.10-0ubuntu1) ...
Processing triggers for ureadahead ...
Processing triggers for ufw ...
Verify that ssh and sshd is in place by executing following command :-
zytham@ubuntu:~$ which sshd
/usr/sbin/sshd
zytham@ubuntu:~$ which ssh
/usr/bin/ssh
Certificate generation:-
RSA key pairs(public and private key) are generated and public key is added in authorized key files(by doing so, we enable SSH access to our local machine with this newly created key.). Execute following commands to generate public/private key pair using RSA algorithm. First switch to hduser1 and then generate keys with empty pass-phrase ("" at end of second command)
zytham@ubuntu:~$ su hduser1
password:
hduser1@ubuntu:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser1/.ssh/id_rsa):
Your identification has been saved in /home/hduser1/.ssh/id_rsa.
Your public key has been saved in /home/hduser1/.ssh/id_rsa.pub.
The key fingerprint is:
43:8b:a7:70:fc:d8:58:6c:56:a4:b6:58:4f:67:d1:3e hduser1@ubuntu
The key's randomart image is:
+--[ RSA 2048]----+
| . .. |
| o .. |
| = o o. |
| . B * o E |
| . = S . . |
| o @ . |
| + o |
| |
| |
+-----------------+
Now we need to add the newly created public key to the list of authorized keys so that Hadoop can use ssh without prompting for a password.Execute following command for that:
hduser1@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
hduser1@ubuntu:~$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is 56:28:c8:c1:22:af:05:75:df:25:3a:89:6c:e1:72:b4.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
hduser1@localhost's password:
Welcome to Ubuntu 13.04 (GNU/Linux 3.8.0-19-generic x86_64)
.......
zytham@ubuntu:~$ ssh -vvv localhost
Hadoop distribution set-up/Installation :-
We have full-filled all prerequisite required for hadoop installation and are in position to install hadoop. Hadoop installation is all about1. Downloading hadoop distribution package (for example; hadoop-2.6.1.tar.gz) and
2. Configure various xml files which drives hadoop cluster functioning.
Download Hadoop distribution package:- Download hadoop distribution from download hadoop core dist, I have downloaded hadoop-2.6.1.tar.gz for this installation. Another way to download Hadoop distribution by executing following command:
zytham@ubuntu:~$ wget http://mirrors.sonic.net/apache/hadoop/common/hadoop-2.6.1/hadoop-2.6.1.tar.gz
zytham@ubuntu:~$ cd Downloads
zytham@ubuntu:~/Downloads$ sudo tar -xvzf hadoop-2.6.1.tar.gz
zytham@ubuntu:~/Downloads$ sudo mv ./hadoop-2.6.1 /usr/local/hadoop2.6.1
zytham@ubuntu:~/Downloads$ cd /usr/local/
zytham@ubuntu:~/usr/local$ ls
zytham@ubuntu:/usr/local$ sudo chown -R hduser1:hadoop /usr/local/hadoop2.6.1/
Set-up Configuration Files:- In order to complete Hadoop installation, we need to update ~/.bashrc (of ubuntu) and four hadoop configuration files.Lets start with bashrc followed by hadoop xml files.
Open bashrc and append following entry at the end of bashrc. Execute following command to open file.
zytham@ubuntu:/usr/local$ gedit ~/.bashrc
#added for hadoop installation
#HADOOP VARIABLES START
export JAVA_HOME=/usr/local/java/jdk1.8.0_60
export HADOOP_INSTALL=/usr/local/hadoop2.6.1
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
#HADOOP VARIABLES END
Now update hadoop configuration files.
1. Update hadoop-env.sh : Open hadoop-env.sh and update JAVA_HOME. JAVA_HOME setting is referenced by Hadoop when it starts.
zytham@ubuntu:~$ cd /usr/local/hadoop2.6.1/etc/hadoop
zytham@ubuntu:/usr/local/hadoop2.6.1/etc/hadoop$ gedit hadoop-env.sh
# The java implementation to use.
#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=/usr/local/java/jdk1.8.0_60
Update core-site.xml :- Create a directory that hadoop will use to store data blocks and assign hduser1 as owner of this directory. In core-ste.xml,path of this newly created directory is configured. Execute following command and update core-site.xml.
zytham@ubuntu:~$ sudo mkdir -p /app/hadoop2.6.1/tmp
zytham@ubuntu:~$ sudo chown hduser1:hadoop /app/hadoop2.6.1/tmp
zytham@ubuntu:~$ sudo gedit /usr/local/hadoop2.6.1/etc/hadoop/core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop2.6.1/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
Update mapred-site.xml:-We need to create mapred-site.xml from mapred-site.xml.template present in /usr/local/hadoop2.6.1/etc/hadoop/ directory. Execute following command to create mapred.site.xml.
zytham@ubuntu:~$ cd /usr/local/hadoop2.6.1/etc/hadoop/
zytham@ubuntu:/usr/local/hadoop2.6.1/etc/hadoop$ sudo cp mapred-site.xml.template mapred-site.xml
zytham@ubuntu:/usr/local/hadoop2.6.1/etc/hadoop$ sudo gedit mapred-site.xml
zytham@ubuntu:/usr/local/hadoop2.6.1/etc/hadoop$ sudo gedit mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
</configuration>
Update hdfs-site.xml:- hdfs-site.xml stores inforamtion about all hosts (containg namenodes, datanodes) in cluster. In other words, it contains information about directories which will be used as the namenode and the datanode on that host.Since, we are working for single node setup, we need to create two directories one for namenode and another for datanode and transfer ownership to hduser1 and hadoop group.
zytham@ubuntu:~$ sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode
zytham@ubuntu:~$ sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode
zytham@ubuntu:~$ sudo chown -R hduser1:hadoop /usr/local/hadoop_store
zytham@ubuntu:~$ sudo gedit /usr/local/hadoop2.6.1/etc/hadoop/hdfs-site.xml
Use following to update <configuration> node:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/datanode</value>
</property>
</configuration>
Note:- It is one time activity, if we format file system again after , all data will be lost stored until that time.
Format HDFS file-system using following command
hduser1@ubuntu:~$ su hduser1
password:
hduser1@ubuntu:~$ cd /usr/local/hadoop2.6.1/bin
hduser1@ubuntu:/usr/local/hadoop2.6.1/bin$ ./hadoop namenode -foramt
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
15/10/04 03:33:34 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = ubuntu/127.0.1.1
STARTUP_MSG: args = [-foramt]
STARTUP_MSG: version = 2.6.1
.....
.......
15/10/04 03:33:34 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
************************************************************/
1. Use ./, if you are running command from inside bin directory, else it gives error: hadoop: command not found
2. If any occurs we can go to logs (/usr/local/hadoop2.6.1/logs) and inspect in "hadoop-hduser1-namenode-ubuntu.log" and rest google knows :)
Start services :- Setup is done, time to test our hadoop installation
Now start all services in pseudo hadoop cluster(single node that's why pseudo cluster) using following command(or using individual command start-dfs.sh and start-yarn.sh.) For the time being we will use one command for starting all services. Go to directory sbin inside hadoop installation and execute start-all.sh.
hduser1@ubuntu:/usr/local/hadoop2.6.1/bin$ cd /usr/local/hadoop2.6.1/sbin
hduser1@ubuntu:/usr/local/hadoop2.6.1/sbin$ ./start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
15/10/04 03:41:25 WARN util.NativeCodeLoader:
Unable to load native-hadoop library for your platform...
using builtin-java classes where applicable
Starting namenodes on [localhost]
localhost: starting namenode,
logging to /usr/local/hadoop2.6.1/logs/hadoop-hduser1-namenode-ubuntu.out
localhost: starting datanode,
logging to /usr/local/hadoop2.6.1/logs/hadoop-hduser1-datanode-ubuntu.out
Starting secondary namenodes [0.0.0.0]
.....
starting yarn daemons
starting resourcemanager,
logging to /usr/local/hadoop2.6.1/logs/yarn-hduser1-resourcemanager-ubuntu.out
localhost: starting nodemanager,
logging to /usr/local/hadoop2.6.1/logs/yarn-hduser1-nodemanager-ubuntu.out
hduser1@ubuntu:/usr/local/hadoop2.6.1/sbin$ netstat -plten | grep java
Stop services:- So, if services can be started there should be some way to stop them.Execute ./stop-all.sh we can stop all services in one shot (or stop-dfs.sh and stop-yarn.sh to stop all the daemons running on our machine individually).
Do not execute it now,believe me it works. We will execute stop-all command after viewing Hadoop Web user interface and verify that demons are running.Use following command to stop all services.
hduser1@ubuntu:/usr/local/hadoop2.6.1/sbin$ ./stop-all.sh
Hadoop Web interfaces:-
Open your browser and hit following urls in different tabs:
http://localhost:50070 http://localhost:50075 http://localhost:50090/
Refer the following diagram and http://localhost:50070 will display name node information.
Please note, localhost:54310, is not randomly assigned port. While configuring core-site.xml, value assigned for property "fs.default.name" is "hdfs://localhost:54310".
Now, stop all services by executing following commands and refresh tab opened earlier, it should be stopped displaying namenode, datanode, secondrynode inforamtion.
hduser1@ubuntu:/usr/local/hadoop2.6.1/sbin$ ./stop-dfs.sh
15/10/04 05:02:39 WARN util.NativeCodeLoader: Unable to load native-hadoop library
for your platform...
using builtin-java classes where applicable
Stopping namenodes on [localhost]
localhost: stopping namenode
localhost: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: stopping secondarynamenode
15/10/04 05:03:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library
for your platform... using builtin-java classes where
applicable
hduser1@ubuntu:/usr/local/hadoop2.6.1/sbin$ ./stop-yarn.sh
stopping yarn daemons
stopping resourcemanager
localhost: stopping nodemanager
no proxyserver to stop
Unable to connect
Firefox can't establish a connection to the server at localhost:50070.
This is all about apace hadoop 2.6.1 installation in ubuntu 13.04.Firefox can't establish a connection to the server at localhost:50070.
We are not yet done !! Just installation is complete. Now in next post we will run world famous "Map-reduce word count" sample program in our single node pseudo cluster and view processed output.
References :-
1. http://www.michael-noll.com/
2. Hadoop: The Definitive Guide: Tom White
Related posts:-
Tags:
Hadoop
This is one such interesting and useful article that i have ever read. The way you have structured the content is so realistic and meaningful. Thank you so much for sharing this in here. Keep up this good work and I'm expecting more contents like this from you in future.
ReplyDeleteHadoop course in Chennai | Best Hadoop Training in Chennai | Big Data Training
Đặt vé máy bay tại Aivivu, tham khảo
ReplyDeletegia ve may bay di my
vé máy bay mỹ về việt nam
giá vé máy bay từ canada về Việt Nam
vé từ nhật về việt nam
khi nào mở lại đường bay hàn quốc
Vé máy bay từ Đài Loan về VN
danh sách khách sạn cách ly tại tphcm
chuyen bay chuyen gia ve viet nam
APTRON Solution's Data Science Training in Noida is your gateway to a successful career in the exciting field of data science. With a comprehensive curriculum, hands-on experience, expert guidance, top-notch infrastructure, placement support, and flexible learning options, we are committed to helping you achieve your data science aspirations. Join us today and unlock your data science potential with APTRON Solution in Noida. Your future in data science begins here.
ReplyDelete