Hadoop Tutorial: A Beginner’s Guide to Learning Hadoop Online

7 mins read3.7K Views Comment

Updated on Sep 30, 2021 10:30 IST

Hadoop is an open-source software framework used for distributed storage and processing of big data using the MapReduce programming model. Written in Java, the framework was developed by Apache Software Foundation and released in 2011.

This Hadoop tutorial covers:

What are the characteristics of Hadoop?
What are the prerequisites for learning Hadoop?
How to install Hadoop?
What is the composition of Hadoop framework?
How does Hadoop work?
What are the popular tools in Hadoop?

Let’s begin with the Hadoop tutorial.

What are the Characteristics of Hadoop?

Hadoop is a commonly-used tool in the big data industry. It has a lot of important features and characteristics:

Open-Source – Hadoop is an open-source project and hence one can modify it according to the requirements.
Reliability – Due to the replication of data in the cluster, data will be stored reliably even when the machine goes down.
High Availability – Data is available and accessible even when the machine or a few of the hardware crashes as there are multiple copies of data.
Scalability – New hardware can be easily added to the nodes without downtime as Hadoop is highly scalable.
Economic – As Hadoop does not require a specialized machine and can work with a cluster of commodity machines, it provides a huge cost reduction.
Easy-To-Use – Framework takes care of the distributed computing so clients do not need to worry about it.
Fault Tolerance – Three replicas of each block are stored across the cluster and hence, data can be recovered from any node when one node goes down.

Recommended online courses

Best-suited Apache Hadoop courses for you

Learn Apache Hadoop with these high-rated online courses

Online Big Data Analytics using Hadoop

Manipal ProlearnCertificate

3.4

Total Fees

₹9.9 K

Duration

5 days

Big Data and Hadoop (Development/ Admin)

Indian Institute of Hardware Technology Ltd (IIHT), KalkajiCertificate

Total Fees

₹30 K

Duration

3 months

Post Graduate Diploma in Data Science

Manipal ProlearnDiploma

Total Fees

₹1.34 L

Duration

300 hours

Certificate Course in Big data Analytics using Hadoop - Online

NIELIT DelhiCertificate

Total Fees

₹1.18 K

Duration

2 weeks

Master Hadoop Administration Weekday Batch

Koenig SolutionsCertificate

Total Fees

Free

Duration

1 day

Discontinued (October,2024)-Certified Course in Apache - Online

NIELIT ChandigarhCertificate

Total Fees

– / –

Duration

80 hours

IBM CE - Essentials of Big Data with Hadoop using IBM InfoSphere BigInsights

BKV Learning Systems, MumbaiCertificate

Total Fees

– / –

Duration

50 hours

Online Certificate In Big Data Analytics For Business and Management

Education Lanes - A Mahindra Group InitiativeCertificate

Total Fees

₹55 K

Duration

150 hours

Online Certificate in Hadoop with Mapreduce

GRV Business Management AcademyCertificate

Total Fees

₹15 K

Duration

10 weeks

Discontinued (Oct-23) - Online Fundamental Course in Big Data And Hadoop

Shyam Lal CollegeCertificate

Total Fees

– / –

Duration

30 hours

What are the Prerequisites for Learning Hadoop?

Hadoop is written in Java and hence, a good knowledge of the language is beneficial. Apart from that, you should have prior exposure to database concepts. Plus, you should be proficient in Linux or Windows operating system.

Explore the best Hadoop Courses

How to Install Hadoop?

For Linux:

Linux is the best platform to run Hadoop. In Linux, you can start from scratch by building and configuring from the start, providing you with a better grasp of the concepts.

Explore popular Linux Courses

Pre-requisites

Step 1: Install Java first. You can download it from the Oracle site.

Step 2: Add a dedicated Hadoop system user. Open the Linux terminal and type the following commands

$ su
password:
# useradd hadoop
# passwd hadoop
New passwd:
Retype new passwd

Step 3: SSH setup and key generation. The following commands are used for generating a key value pair using SSH.

$ ssh-keygen -t rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys

Step 4: Disable IPv6. Hadoop is not supported on IPv6 networks and some Linux releases default to IPv6 only.

Installing

Step 1: Download Hadoop. Download from the Apache website.

Step 2: Format the HDFS file format. It creates an empty filesystem by creating storage directories and the primary versions requirements of the namenode’s persistent data structures.

The steps are:

Get access permission – % hadoop namenode – format

Starting and stopping daemon using MapReduce algorithm –

% start-dfs.sh
% start-mapred.sh
% stop-dfs.sh
% stop-mapred.sh

Starting and stopping daemon using MapReduce 2 –

% start-dfs.sh
% start-yarn.sh
% stop-dfs.sh
% stop-yarn.sh

Step 3: Hadoop configuration – It is required to make changes in those configuration files according to your Hadoop infrastructure.

$ cd $HADOOP_HOME/etc/hadoop

Also Read: Top Hadoop Interview Questions & Answers

Windows

This tutorial is for running Hadoop on Windows using Hortonworks Data Platform (HDP).

Prerequisites

Step 1: Install Java as mentioned in the Linux steps.

Step 2: Install Python. You can download it from the Python site.

Step 3: Ensure HDP finds Python by updating the PATH System Environment variable.

Go to Computer > Properties > Advanced System Settings > Environment variables.

Step 4: Install HDP. Download from the HDP site.

Step 5: Validate install by running smoke tests. In command prompt switch to using the ‘Hadoop’ user

> runas /user:hadoop cmd
> cd C:hdp
> Run-SmokeTests Hadoop

This will fire up a MapReduce job.

Now Hadoop is ready to use on your Windows.

Also Read: Career Advantages of Hadoop Certification

What is the Composition of the Hadoop Framework?

In this section of the Hadoop tutorial, we will discuss the composition of the Hadoop framework.

The base architecture of an Apache Hadoop framework consists of the following modules:

Hadoop Common – It is a collection of the common Java libraries and utilities that support other Hadoop modules. Also known as Hadoop Core, it also contains the necessary Java Archive (JAR) files and scripts required to start Hadoop. It also provides the source code and documentation.
MapReduce – A framework for easily writing applications that process big amounts of data in parallel on large clusters of hardware in a reliable manner.

It refers to the two tasks that Hadoop programs perform:

The Map Task: In this task, input data is converted into a set of data where individual elements are broken down into tuples (key/value pairs).
The Reduce Task: It takes input from a Map Task and combines those data tuples into a smaller set of tuples. This always takes place after the Map Task.

Hadoop Yarn – A cluster management technology, this platform is responsible for managing computing resources in clusters and using them for scheduling users’ applications.

Resource Manager – Manages and allocates cluster resources.
Node Manager – Manages and enforces node resources allocations.
Application Master – Manages application lifecycle and task scheduling.

Hadoop Distributed File System (HDFS) – HDFS stores data on commodity machines and is designed to run on large clusters of small computer machines.

HDFS uses a master/slave architecture where the master consists of a single NameNode that manages the file system metadata and one or more slave DataNodes that store the actual data.

How Does Hadoop Work?

We have already seen the architecture of Hadoop and have a basic understanding of the framework. Let us now take a look at how it works internally:

In Hadoop, the data submitted by clients is stored in HDFS and processed in MapReduce.

How is Data Stored in HDFS?

HDFS is the storing element of Hadoop. There are 2 daemons that run on HDFS:

Namenode – It runs on the master node and stores the metadata.
Datanode – It runs on slaves and stores the actual data.

The data is broken into chunks called blocks and each block is replicated, then stored in different nodes in the cluster as per the replication factor (the default is 3).

Also Read: Top Reasons to Learn Python and Hadoop

How is Data processed?

MapReduce is the processing layer of Hadoop and has 2 daemons:

Resource manager – It splits the job submitted by the client into small tasks.
Node manager – It actually does the tasks in parallel in a distributed manner on data stored in datanodes.

Here are the steps in summary:

Step1: User submits a job to Hadoop and this input data is broken into blocks of size 64 Mb or 128 Mb.

Step 2: These blocks are distributed to different nodes. Processing of data takes place once all the blocks of the data are stored on data-nodes.

Step 3: Resource Manager schedules the program on individual nodes.

Step 4: After the processing of data, the output is written back on HDFS.

Also Read: Spark Vs Hadoop

What are the Popular Tools in Hadoop?

This section of the Hadoop tutorial focuses on different popular tools in Hadoop.

Apache Pig – It is a high-level platform for analyzing large data sets and consists of high-level language called Pig Latin. It is used for the following purposes:

ETL data pipeline
Research on raw data
Iterative processing

It is extremely popular and has different user groups. E.g. 90% of Yahoo’s MapReduce and 80% of Twitter’s MapReduce is done by Pig.

Apache Hive – It is like a data warehouse that uses MapReduce for the purpose of analyzing data stored on HDFS. It provides a query language called HiveQL that is familiar to the Structured Query Language (SQL) standard. It is also as much popular as Pig and is used by CNET and Facebook.

HBase – Based on NoSQL, it is an open-source database platform and is built on topmost of the Hadoop file system. There are two components of HBase – 1) HBase master and 2) RegionServer.

Apache Zookeeper – Zookeeper is a coordinated service that allows the distribution of processes to organize with each other through a shared hierarchical namespace of data registers. It allows developers to focus on core application logic without worrying about the distributed nature of the application.

Also Read: Mastering Hadoop: Pros and Cons of Using Hadoop technologies

Learn In-Depth

To become an expert in Hadoop there are a lot of areas that you need to have hands-on experience in. Also, you need to be proficient in the various tools and techniques associated with it.

We hope this Hadoop Tutorial helped you understand Hadoop in-depth. Explore Hadoop Courses and Big Data courses on Shiksha Online to get the required practical skills to start your career in the field of Hadoop and Big Data.

About the Author

Shiksha Online

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio

Hadoop Tutorial: A Beginner’s Guide to Learning Hadoop Online

What are the Characteristics of Hadoop?

Best-suited Apache Hadoop courses for you

Online Big Data Analytics using Hadoop

Big Data and Hadoop (Development/ Admin)

Post Graduate Diploma in Data Science

Certificate Course in Big data Analytics using Hadoop - Online

Master Hadoop Administration Weekday Batch

Discontinued (October,2024)-Certified Course in Apache - Online

IBM CE - Essentials of Big Data with Hadoop using IBM InfoSphere BigInsights

Online Certificate In Big Data Analytics For Business and Management

Online Certificate in Hadoop with Mapreduce

Discontinued (Oct-23) - Online Fundamental Course in Big Data And Hadoop

What are the Prerequisites for Learning Hadoop?

How to Install Hadoop?

For Linux:

Pre-requisites

Installing

Windows

Prerequisites

What is the Composition of the Hadoop Framework?

How Does Hadoop Work?

What are the Popular Tools in Hadoop?

Learn In-Depth

Top Picks & New Arrivals