In this tutorial I am going to guide you through setting up hadoop environment on Ubuntu.
Table of Contents
- 1 Step 1 – Add Hadoop Group and User (Optional)
- 2 Step 2 – Installing Java 8 and Hadoop 2.9.2 on Linux machine
- 3 Step 3 – Setup Hadoop and JAVA Environment Variables
- 4 Step 4 – Verifying java and Hadoop installation
- 5 Step 5 – Setup SSH Certificate In Hadoop
- 6 Step 6 – Setup Hadoop Configuration Files
- 7 Step 7 – Format Namenode
- 8 Step 8 – Start Hadoop Cluster
- 9 Step 9 – Access Hadoop Services in Browser
- 10 Step 10 – Running A Map-Reduce Example Job on a single node Cluster
Prerequisite
- Ubuntu 19.04 (Any Stable Release)
- Java 8 (Any Stable Release)
- Hadoop 2.9.2 (Any Stable Release)
Download Link
- https://download.oracle.com/otn/java/jdk/8u221-b11/230deb18db3e4014bb8e3e8324f81b43/jre-8u221-linux-x64.tar.gz
- http://www-eu.apache.org/dist/hadoop/common/hadoop-2.9.2/hadoop-2.9.2.tar.gz
Note – You can also download and install java and Hadoop through terminal Command.
Step 1 – Add Hadoop Group and User (Optional)
Creating a normal (root) account for Hadoop
Command: sudo adduser hduser
Command: sudo adduser hduser sudo
After user is created, re-login into ubuntu using hduser
Step 2 – Installing Java 8 and Hadoop 2.9.2 on Linux machine
First you have to download java 8 and Hadoop 2.9.2 in the given above download link and just paste the downloaded file to the ubuntu HOME directory.
- hadoop-2.9.2.tar.gz
- jre-8u221-linux-x64.tar.gz
You got these two files in your Ubuntu HOME directory we just need to extract this file in current directory just follow these steps.
Open Terminal and type following command: –
Command: tar -xvf jre-8u221-linux-x64.tar.gz
Command: tar -xvf hadoop-2.9.2.tar.gz
Step 3 – Setup Hadoop and JAVA Environment Variables
Open. bashrc file.
Command: nano ./.bashrc
OR
Command: sudo gedit ~/.bashrc
Now, add Hadoop and Java Path as shown below.
#Hadoop variables
export HADOOP_HOME=/home/your-username/hadoop-2.9.2
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
#Java variables
export JAVA_HOME=/home/your-username/jre1.8.0_221
Change your-username to your username Then, save the bash file and close it.
Note – Use Command: whoami to Display Your Username.
For applying all these changes to the current Terminal, execute the source command.
Command: source .bashrc
Now edit $HADOOP_HOME/etc/hadoop/hadoop-env.sh file and set JAVA_HOME environment variable.
Change the JAVA path as per install on your system.
Command: cd $HADOOP_HOME/etc/hadoop/
Command: nano hadoop-env.sh
Or
Command: sudo gedit hadoop-env.sh
In the file add the line and save it.
export JAVA_HOME=/home/your-username/jre1.8.0_221
Step 4 – Verifying java and Hadoop installation
To make sure that Java and Hadoop have been properly installed on your system and can be accessed through the Terminal, execute the java -version and hadoop version commands.
Command: java -version
Command: hadoop version
Step 5 – Setup SSH Certificate In Hadoop
Command: sudo apt-get install ssh (Optional Step if not already installed)
Command: ssh-keygen -t rsa -P ” -f ~/.ssh/id_rsa
Command: cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Command: chmod 0600 ~/.ssh/authorized_keys
Verify key based login
Command: ssh localhost
Command: exit
Step 6 – Setup Hadoop Configuration Files
We need to configure basic Hadoop single node clusters as per requirements of your Hadoop infrastructure.
Command: cd $HADOOP_HOME/etc/hadoop
Command: ls
All the Hadoop configuration files are located in hadoop-2.9.2/etc/hadoop directory.
Open core-site.xml and edit the property mentioned below inside configuration tag:
Command: gedit core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Edit hdfs-site.xml and edit the property mentioned below inside configuration tag:
Command: cd
Command: mkdir -p /home/hadoop/hdfs/namenode
Command: cd /home/hadoop/etc/hadoop
Command: gedit hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:/home//your-username/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:/home//your-username/hadoop/hdfs/datanode</value>
</property>
</configuration>
Edit the mapred-site.xml file and edit the property mentioned below inside configuration tag:
In some cases, mapred-site.xml file is not available. So, we have to create the mapred-site.xml file using mapred-site.xml template. Then save as mapred-site.xml.
Command: cp mapred-site.xml.template mapred-site.xml
Command: gedit mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Edit yarn-site.xml and edit the property mentioned below inside configuration tag:
Command: gedit yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Step 7 – Format Namenode
Go to Hadoop home directory and format the NameNode.
Command: cd
Command: hdfs namenode -format
This formats the HDFS via NameNode. This command is only executed for the first time. Formatting the file system means initializing the directory specified by the dfs.name.dir variable.
Never format, up and running Hadoop filesystem. You will lose all your data stored in the HDFS.
Step 8 – Start Hadoop Cluster
Either you can start all daemons with a single command or do it individually.
Command: start-all.sh
The above command is a combination of start-dfs.sh, start-yarn.sh & mr-jobhistory-daemon.sh
Or you can run all the services individually as below:
Command: start-dfs.sh
Command: start-yarn.sh
To check that all the Hadoop services are up and running, run the below command.
Command: jps
Step 9 – Access Hadoop Services in Browser
Hadoop NameNode started on port 50070 default. Access your server on port 50070 in your favorite web browser.(Its system and OS based)
Now access port 8088 for getting the information about the cluster and all applications
Step 10 – Running A Map-Reduce Example Job on a single node Cluster
Command: cd $HADOOP_HOME
Command: touch input.txt
Command: gedit input.txt (add some sample text for MapReduce task)
Command: hdfs dfs -mkdir -p input
Command: hdfs dfs -put input.txt input
Command: hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar wordcount input output
Command: hdfs dfs -ls output
Command: hdfs dfs -cat output/part-r-00000
That’s all guys if you have any questions regarding the installation of Hadoop on Ubuntu Linux fill free to ask the question in the comment box below and don’t forget to like and share this article.
Leave a Reply