Hadoop Installation

Hadoop can be installed on the GNU/Linux platform, where the installation process is straightforward. However, if you are installing Hadoop on an operating system other than Linux, you'll need to set it up on VirtualBox.

In this discussion, we will cover the installation of Hadoop in both Linux and VirtualBox environments.

Installing/verifying Java -

Java is the primary requirement for installing and running Hadoop. The latest version of the Java JDK must be installed on the system. You can check the version using the command below.

> java -version

If java is installed on the machine, it will give you the following output.

java version "1.8.0_161" 
Java(TM) SE Runtime Environment (build 1.8.0_161-b12) 
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)

If java is not installed, install the java with below steps.

Step1 - Download java (JDK <latest version> - X64.tar.gz) from the following link https://www.oracle.com/java/technologies/downloads/.

Download jdk-8u<latest-version>-linux-x64.tar.gz from the list for the 64-bit system. If different bit system, choose accordingly.

For example, if the latest-version is 161, the file to download would be jdk-8u161-linux-x64.tar.gz.

Step2 - Change the directory to target directory where the java needs to installed and move the .tar.gz to the target directory.

Step3 - Unpack the tarball and install the JDK.

> % tar zxvf jdk-8u<latest-version>-linux-x64.tar.gz

For example, if the latest-version is 161, the command would be

> % tar zxvf jdk-8u161-linux-x64.tar.gz

he Java Development Kit files are located in a directory named jdk1.8.0_<latest-version> within the current folder.

Step4 - Delete the .tar.gz file to save disk space.

> rm jdk-8u<latest-version>-linux-x64.tar.gz

Now verify the java version with -version command from the terminal.

> java -version

Produces the below output -

java version "1.8.0_161" 
Java(TM) SE Runtime Environment (build 1.8.0_161-b12) 
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)

Pre-installation Setup -

Before proceeding with the Hadoop installation, it is important to complete the Linux installation and ensure that it is functioning properly. Additionally, some preinstallation setups are required in the Linux environment. The steps below outline how to set up the Linux environment effectively –

a. Creating a User Group and User -

Creating Hadoop user group and user is not mandatory. But it is recommended before installing Hadoop. Open the Linux terminal and type below command to create user group.

$ sudo addgroup hadoop

If the command is successful, you will get the below messages and command prompt will display again.

$ sudo addgroup hadoop
Adding group ‘hadoop’ (GID 1001) …
Done.
$

Type below command to create user.

$ sudo adduser —ingroup hadoop hdpuser

If the command is successful, you will be prompted to enter the below details highlighted in bold.

$ sudo adduser —ingroup hadoop hdpuser
Adding user ‘hdpuser’ ...
Adding new user ‘hdpuser’ (1002) with the group ‘hadoop’ ...
Creating home directory ‘/home/hdpuser’ ...
Copying files from ‘/etc/skel’ ...
Enter new UNIX password:
Retype new UNIX password:
Password: password updated successfully
Changing the user information for hdpuser
Enter the new value or press enter for the default
	Full Name[]: 
	Room Number[]:
	Work Phone[]:
	Home Phone[]:
	Other[]:
Is the information correct? [Y/n]  Y
$

Once the command prompt appeared, then the user created successfully.

b. SSH Server installation -

SSH is used for remote login. SSH is required in Hadoop to manage its nodes, i.e. remote machines and local machine. Hadoop requires SSH Server to login to local host. So SSH Server installation is mandatory step and should be installed successfully.

Step1 - Installation of SSH: Type below command to install the SSH server

$ sudo apt-get install ssh

$ sudo apt-get install pdsh

Step2 - After installing the SSH server, need to login as a newly created user. This step is optional and not mandatory. Type below command to login with the user created

$ su – hdpuser

Step3 - Generate Key Pairs: As a next step, create the SSH key using sshgen command.

$ ssh-keygen -t rsa -P ""

Step4 - After successful key generation, the key needs to be added to the authorized keys file to enable the login without prompting for password.

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Step5 - Change the file permission that contains keys.

$ chmod 0600 ~/.ssh/authorized_keys

c. Downloading Hadoop

Download and extract Hadoop 3.1.0 from Apache software foundation using the following commands.

$ wget http://www-us.apache.org/dist/hadoop/common/
 hadoop-3.1.0/hadoop-3.1.0-src.tar.gz$

Once the download completes, we need to untar the Tarball archive

$ tar -xzf hadoop-3.1.0.tar.gz

Rename the folder extracted to hadoop to avoid confusion

$ mv hadoop-3.1.0 hadoop

Installing Hadoop in Standalone Mode

The below processing steps shows the installation of hadoop.

a. Setting Up Hadoop

Step1 - First step need to update the bashrc file with few environment variables.

$vi .bashrc

Copy below information to bashrc and save it

# Set Hadoop-related environment variables,
 Points to Hadoop home directory
export HADOOP_PREFIX=/home/hduser/hadoop
# This one points to the Java home oracle home directory. 
# Set JAVA_HOME (we will also configure JAVA_HOME directly
 for Hadoop later) 
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
# The last one is to update the PATH to include the Hadoop
 Home directory
# Add Hadoop bin/ directory to PATH 
export PATH=$PATH:$HADOOP_PREFIX/bin
export PATH=$PATH:$HADOOP_PREFIX/sbin
export HADOOP_MAPRED_HOME=${HADOOP_PREFIX}
export HADOOP_COMMON_HOME=${HADOOP_PREFIX}
export HADOOP_HDFS_HOME=${HADOOP_PREFIX}
export YARN_HOME=${HADOOP_PREFIX}
:wq
$

Step2 - Update hadoop-env.sh file with java Home directory. In this update only, java host file

$ vi /home/hdpuser/hadoop/conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
:wq
$

Step3 - Update core-site.xml file to update the temp directory and fs default name.

$ mkdir /home/hduser/tmp

$ vi /home/hdpuser/hadoop/conf/core-site.xml

Copy the below information in between the configuration tags (<configuration></configuration>).

<configuration>
<property> 
<name>hadoop.tmp.dir</name> 
<value>/home/hduser/tmp</value>
</property>
<property> 
<name>fs.defaultFS</name> 
<value>hdfs://localhost:9000</value> 
</property>
</configuration>

Step4 - Update the mapred-site.xml to define the property of job tracker

$ vi /home/hdpuser/hadoop/conf/mapred-site.xml

Copy the below information in between the configuration tags (<configuration></configuration>)

<configuration>
<property> 
<name>mapreduce.job.tracker </name>
<value>localhost:9000</value> 
</property>
</configuration>

Step5 - Update the hdfs-site.xml to define the property of replication.

$ vi /home/hdpuser/hadoop/conf/hdfs-site.xml

Copy the below information in between the configuration tags (<configuration></configuration>)

<configuration>
<property> 
<name>dfs.replication</name> 
<value>1</value> 
</property>
</configuration>

Now the configuration completed successfully.

2. Verify the hadoop Installation

Step1 - Format the name node when the hadoop running first time.

$ hadoop namenode –format

Remember, this is one-time activity, should be done only first time when running hadoop and not required every time.

Step2: Start Hadoop - There are two scripts to start hadoop

$ start-dfs.sh

$ start-mapred.sh

Step3 - To check the services running on hadoop, use the below command

$jps

2.1. Accessing Hadoop on Browser

The default port number to access Hadoop is 50070 and use the following url to get Hadoop services on browser.

http://localhost:50070/

2.2. Verify All Applications for Cluster

The default port number to access all applications of cluster is 8088 and use the following url to visit this service.

http://localhost:8088/

2.3. To stop Hadoop need to use the below commands

$ stop-dfs.sh

$ stop-mapred.sh