Integration of LVM with Hadoop and providing Elasticity to DataNode Storage

7 min readMar 16, 2021

Firstly let’s understand what is LVM?

Logical Volume Management(LVM) is a tool that helps in managing disk storage. Using this way we can have any type of customization of the Hard disk like we can combine hard disks to work as a single hard disk.

Some terminologies used are:

Partition: A Hard disk can be divided into parts which are known as partitions.
Physical Volume(PV): It can be defined as any physical storage device such as a Hard disk or even a partition, that has been initialized as a Physical Volume.
Volume Group(VG): PV gives their complete storage to a box that contains PV’s storage and the OS considers it to be a new hard disk. This box is known as the volume group (a group of volumes).
Logical Volume(LV): Group of volumes is not physically available therefore they are known as Logical Volume. We can create multiple Logical volumes from a single VG.

What is Hadoop?

To solve the problem of big data, we use distributed storage and one such software that helps in implementing distributed storage is Apache Hadoop.

There are various components in Hadoop, one of them is the Hadoop Distributed File System (HDFS) which is the storage unit of Hadoop.

Hadoop HDFS works on the Master-slave model. In this model, several systems(Slaves/data nodes) contribute their storage to a central system(Master/name node) which maintains metadata and manages the data nodes.

Note: In the commands, whatever written in “<>” are not the actual part of the command, I have used them to indicate what would come at that place.

Now let’s move to the integration of LVM with Hadoop HDFS.

To do this we are gonna follow the following steps

Installation of Hadoop

I have installed Hadoop in Rhel 8 system running in Oracle Virtualbox. I have installed Java 1.8 version and Hadoop 1.2 version.

Configuring Name Node

To configure Name Node, we first find the IP of the name node using the ‘ifconfig’ command. In my Name Node, IP is ‘192.168.43.99'.

We create a folder in ‘/’. I have created a directory with the name ‘/nn’ using the ‘mkdir /nn’ command.

Then we go into the ‘/etc/hadoop’ directory and then in the ‘hdfs-site.xml’ file, we give the name of the directory like shown in the picture given below.

In the same directory, we go into ‘core-site.xml’ and tell the system it is a Name node along with IP of the Name Node and port No. and protocol which will be used, like in the given picture.

Note: If we are configuring for the first time then we also need to format the Name Node using the “hadoop namenode -format” command. This command needs to run only once.

We can start the Name Node using the “hadoop-daemon.sh start namenode” command.

Creating Logical Volume for the Data Node

Firstly I have attached two hard disks of size 10 GiB and 20 GiB to my VM, by following these steps:

Go to Settings > Click on Storage > Click on Controller > Choose the adds hard disk icon > Click on create > Choose VDI > In next, choose Dynamically allocated > In Next, give the name and size along with the location of the new hard disk > Click create.

Then I created PV using the commands:

“pvcreate /dev/sdb” and “pvcreate /dev/sdc”.

To check the details of the respective PVs, we can use: “pvdisplay /dev/sdb” and “pvdisplay /dev/sdc”.

Then we create the VG using “vgcreate <VG_name> <PV_name> <PV_name>” and then we can see the details using “vgdisplay <VG_name>”.

Next, we move to create LV, which we can do using the “lvcreate” command whose syntax can be seen in the picture.

To format the LV we have various format types and I have chosen the ‘ext4’ format type, so I have used the “mkfs.ext4 <Path_of_LV>” command.

After we format the LV, we can mount it on the desired directory using “mount <Path_of_LV> <directory>”.

Note: Mount is temporary in nature, as we restart the system, we again have to mount it. So, we can use the ‘fstab’ file to make the mount permanent.

To check if the LV is mounted, we can run the “df -h” command.

Configuring Data Node

We put the name of the directory on which the LV is mounted in the /etc/hadoop/hdfs-site.xml file like in the picture given below.

The hdfs-site.xml file of the data node.

In the core-site.xml file, we have to tell which system is the master node by telling the master’s IP along with the protocol and Port No. to use. The syntax that is used to tell these are shown in the picture below.

We then start the Data Node using the “hadoop-daemon.sh start datanode” command.

We can then run the “hadoop dfsadmin -report” to check if the data node is contributing its storage along with the details.

Now, let’s move on to the Extension and Reduction in the size of the LV (which is present in the Data node)

Extending LV Size

If there is free space available in the VG from which the LV is created, then we can increase the size of the LV, by using two commands which are:

“lvextend — size +<size>G /dev/mapper/<VG_name>-<LV_name>”.

After we run the above command only our previous part of the LV is formatted and the remaining part(newly added part) is not, so to format the remaining part, I have used the command below:

“resize2fs /dev/mapper/<VG_name>-<LV_name>”

After running this when we check the contributed storage(using ‘hadoop dfsadmin -report’) of slaves and would see that the size would have increased.

Contributed storage after the size of LV increased.

Reducing LV Size

Firstly, we need to stop contributing the storage to the Name Node, which we can do by stopping the data node using “hadoop-daemon.sh stop datanode”.

Stopping datanode.

To reduce the size of LV, we have to follow these steps:

First, we need to unmount so that nothing new could be added at the time of reduction and we can do this using the command: “umount <Directory_on_which_LV_is_mounted>”

Unmounting and checking if it is unmounted

Then we have to clean or scan the inode table(table maintaining the information of the files) to remove garbage or unwanted mapping. To do this we use, “e2fsck -f /dev/<VG_name>/<LV_name>”.
Next, we format the inode table partially using the command: “resize2fs /dev/mapper/<VG_name>-<LV_name> <size_after_reduction>G”.

Now, we reduce the physical space, using “lvreduce — size <size_after_reduction>G /dev/mapper/<VG_name>-<LV_name>”.
Then, we again mount the directory

Checking if the mount is done and starting data nade.

Lastly, we need to start the data node services again and as we do it we will see the reduced size of the storage will be contributed to the name node now.

Checking the reduction in the size of contributed storage.

Using the Extension and Reduction in the size of the LV (which is present in the Data node) we are providing elasticity to the data node storage.

Thus we have successfully accomplished the integration and providing elasticity to the data node storage!

Integration of LVM with Hadoop and providing Elasticity to DataNode Storage

Firstly let’s understand what is LVM?

What is Hadoop?

Now let’s move to the integration of LVM with Hadoop HDFS.

Now, let’s move on to the Extension and Reduction in the size of the LV (which is present in the Data node)

Thank you for reading my blog🤗

Written by Pankhuri Sharma