HDFS Configuration Using Ansible

Pankhuri Sharma
7 min readMay 29, 2021

--

To configure the HDFS cluster, I have used Hadoop version 1.2.1, JDK -8 and all the systems are RHEL 8.

Let’s first understand the basics, what is Big Data and what is Hadoop?

🔷Big data means a huge amount of raw data that is collected generally on daily basis. It is so huge that we can’t store it in a single hard disk and even if we did, it would take a huge amount of time to load data from the hard disk to the RAM.

🔷Hadoop is an open-source software through which we can store and analyze big data that works in a cluster (distributed storage and computing). It follows master-slave architecture.

Hadoop consists of :

a) Hadoop Distributed File System(HDFS): It is a distributed data storage system. When a file is placed in HDFS, it is divided into blocks with a default block size of 64 MB. These blocks are then replicated across the cluster’s various nodes (called DataNodes). And all the metadata is stored in the NameNode.

Here NameNode is the master & DataNode is the slave node.

b) MapReduce: It is used for processing a huge amount of data.

Here JobTracker is the master and the TaskTrackers are the slaves.

In this task, we will create a multi-node HDFS cluster in fully distributed mode with the ansible-playbook. Before we go on, let’s understand some terms useful for understanding our task.

🔻 Terms and files related to Hadoop:

  • NameNode: The name node is the master node that stores all the location information of all the files in the HDFS, i.e. it stores all the metadata of the HDFS.

- Client comes to the NameNode to get the details from where to access a file and NameNode replies with the location of DataNodes storing the particular file.

  • DataNode: The data nodes are commodity hardware that provides their storage and stores the files. They constantly send a keep-alive message to the master node to tell, they are working fine and files could be stored in it.
  • HDFS-SITE.XML file: The hdfs-site.xml file contains information such as the value of replication data, namenode path, and datanode paths of our local file systems.
  • CORE-SITE.XML file: The core-site.xml file contains information such as the port number used for the Hadoop instance, memory allocated for the file system, etc.

🔻 Terms related to Ansible:

  • Control node: It is the node where Ansible is installed and ad-hoc commands and playbook is executed from.
  • Managed nodes: These are the nodes that Ansible configures.
  • Inventory: It is a file that has information about the managed hosts that would be needed for making a connection to the managed nodes. It also is useful in grouping managed nodes for easier management.
  • Task: It is a block of code that runs a module that does something on the managed node. Eg: Starting a service.
  • Play: To perform a series of tasks on the managed nodes we specify it in a play. The play specifies the order in which the tasks must be completed.
  • Playbook: It is a file that contains one or more plays.

✅ Now, let’s start with the task. Our target is to create an ansible playbook that configures a master node and a slave node.

📌 Firstly let’s create the inventory according to our requirements. In the inventory file, I have described the master and the slave nodes’ IP address, user, password, and protocol to connect.

Inventory File

📌I have created a playbook called “hadoop.yml” that takes the variables from the “hadoop_var.yml” and has the hdfs-site.xml & core-site.xml files for both the master and the slave for configuration.

Files in the workspace

📌 The variables file include the name of the directory which would be used to mount the DVD, the directory for namenode, and for datanode along the Port No. to work on.

Hadoop_var.yml file that contains all the variables with their values.

For configuring NameNode or DataNode we need to configure hdfs-site.xml and core-site.xml files where inside the configuration tag we have to create property tag which has name and value tags.

📌 For configuring NameNode we need to tell that the system is configured as namenode for distributed file system and also tell the directory name that will be used for storing metadata.

hdfs-site.xml for configuring master node

We also have to configure the core-site.xml file, where we would tell which system is the namenode, which is its IP address, and on which port No. we want it to work.

In this, I have used Ansible Facts to gather the IP address of the NameNode.

core-site.xml for configuring master node

📌 Configuration files for slave node include hdfs-site.xml where we tell the system to be configured as a datanode and which would the directory where all the data would be kept.

hdfs-site.xml for configuring data node

The core-site.xml tells which is the Namenode by its IP address. It is the same as the core-site.xml file of the master. As the play to configure datanode is different so I the output of Ansible Facts would be DataNode’s IP instead of NameNode’s IP. So I have directly put the IP of the NameNode.

core-site.xml for slave node

I have used the following modules in the playbook for configuring the HDFS cluster:

🔸 file module: It is used to create, delete or change attributes of files, directories, and symlinks.

🔸 mount module: It allows us to mount a volume as it controls active and configured mount points in the “/etc/fstab”

🔸 yum_repository module: It is used to add or remove yum repositories in an RPM-based Linux.

🔸 copy module: It is used to copy a file to a remote machine.

🔸 yum module: We can install, uninstall, upgrade, or downgrade packages using this module.

🔸 command module: It is used to execute commands but is not processed through the shell. It is not an idempotent module.

🔸 template module: It is similar to the copy module, but we can do variable interpolation using it as it is processed by Jinja2 templating language.

🔸 firewalld module: We can use it to add or delete ports and services to the firewall rules.

🔸 pause module: We can use this module to pause the execution of the playbook for a specific amount of time or until the prompt has been acknowledged.

🔸 shell module: It is similar to the command module but we can use shell operations (“<”, “>”, etc.) using it.

🔰 To configure the HDFS cluster, we have to do these things:

  1. Installing JDK and Hadoop in every node.
  2. Configure hdfs-site.xml and core-site.xml in both master and slave. In the master node, we also format the directory for storing metadata.
  3. Starting the services of namenode and datanode.

Ansible playbook to configure HDFS cluster

hadoop.yml file

To run an Ansible playbook we use, “ansible-playbook hadoop.yml”.

After successful execution of the playbook, we can run the “jps” command in the nodes to see if they are successfully configured as NameNode or DataNode.

NameNode
DataNode

We can also run “hadoop dfsadmin -report” to see the Datanodes available along with the storage capacity provided by them.

Checking all the total capacity of storage provided.

🔰 To see the code use the link given below: https://github.com/pankhurisharma390/hdfs-setup-ansible

Thus we have successfully configured the HDFS cluster with the Ansible Playbook!

Thank you for taking the time to read this article; I hope it was beneficial.

--

--