Creating elastic storage for Hadoop data node using LVM

Rishi Agrawal
5 min readMar 16, 2021

This blog is all about connecting the LVM to Hadoop data node storage to provide the elasticity in the storage i.e. we can change the size of the storage on the fly and we can see the effect on the data node storage.

This blog is the part of the task 7.1(a) given by Vimal Daga sir, the problem statement is as follows:

ARTH — Task 7.1(A) 👨🏻‍💻

Task Description 📄

Integrating LVM with Hadoop and providing Elasticity to DataNode Storage

Lets, start with knowing what is LVM?

LVM stands for Logical Volume Management, is a tool used to create the storage from the volume group, and allow to change the size of the logical volume on-line without need to stop the Data Node.

With LVM, a hard drive or set of hard drives is allocated to one or more physical volumes. LVM physical volumes can be placed on other block devices which might span two or more disks.

We will see LVM in detail, how to implement it with integration with the Hadoop data node.

What is Hadoop and Hadoop Data Node ?

Hadoop is the tool used to solve the problem of big data, by using the technique of the distributed computing. It is the framework build upon JAVA and in this blog I am using the Apache Hadoop v1. Here, we are going to integrate the distributed storage with the LVM.

Hadoop works on a cluster i.e. Master Slave configuration, and the Slave Node in the Hadoop is called as Data Node. It provide all the storage to it’s Master Node. And Master Node just keeps the Index where the data is.

Here I have already launched my 1 data node cluster and I am not including here, as this is not the focus of this blog.

System Setup

I am using my Hadoop cluster with 1 data-node and it is installed over my VMware. I have attached two 20 GB extra virtual Hard Disk to my data node VM and my network adaptor is set to bridge adaptor.

Setting up LVM!

Setting up LVM is to easy just follow this steps:

Step1: Make sure the Hard Disk are connected and run this command:

[root@localhost ~]# fdisk -l

The highlighted ones are the Hard Disk that I have attached and note the location, for me its /dev/nvme0n2 and /dev/nvme0n3.

Step2: To create a logical volume first we need volume group. So, first we need to create and map the physical volume(PV) . The command use to create PV is:

[root@localhost ~]# pvcreate /dev/nvme0n2

And to check details of the created PV can be checked by this command:

[root@localhost ~]# pvdisplay /dev/nvme0n2

I have use command for the first HDD, but do this for both the HDD.

Step3: As the PV is created now we can create the Volume Group(VG) can be created and add PVs to the volume group with this command:

[root@localhost ~]# vgcreate myvg1 /dev/nvme0n2 /dev/nvme0n3

With this command the volume group with name myvg1 has been created.

To display the details about the Volume Group we can use the command:

[root@localhost ~]# vgdisplay myvg1

Step4: From above we can see that we have created a volume group of 40GiB, and now we are going to create 10 GiB logical storage. For this the commands as follows:

[root@localhost ~]# lvcreate --size 10G --name mylv1 myvg1
[root@localhost ~]# lvdisplay myvg1/mylv1

The first command will create a logical volume and second command will display the details about the logical volume.

Step 5: Logical volume is created, so its time to format it! to format we have to follow simple step: just run this command.

[root@localhost ~]# mkfs.ext4 /dev/myvg1/mylv1

Step 6: Everything is done! Now we need to just mount it. I am mounting on the /datanode dir

[root@localhost ~]# mkdir /datanode
[root@localhost ~]# mount /dev/myvg1/mylv1 /datanode

All set! lets integrate with the Data Node Storage.

To integrate with the data-node storage I have dir as /datanode.

Now let's start the service, and head over to the main practical.

And its connected! Lets also open the web console too.

Now, as the Hadoop cluster is running fit and fine lets change the size of the mylv1 from 10Gib to 30 GiB and lets see if it reflects on the dfsadmin report and web UI.

To add the size use the following command,(note: only adding the volume will not work we need to format the newly added part with resize2fs tool)

[root@localhost ~]# lvextend --size +20G /dev/myvg1/mylv1
[root@localhost ~]# resize2fs /dev/myvg1/mylv1

Size of the mylv1 has increased from 10 GiB to 30 GiB and showing in df -h list too. So, excited to run hadoop dfsadmin -report if it changed on the fly too or not. and the results:

It changed on the fly as we want. yay!

USECASE:

Okay, I have successfully demonstrated how to do it. But what is the real use case?

The major use case, of this concept is in the servers like the data-centers of Facebook like enterprise. They have to increase the storage on the fly to accommodate new data, and they cannot shutdown the server, that will affect the uptime of application.

Hope you have enjoyed it! Thanks a lot for reading.

If any issues or suggestion, please do let me know in the comments!

--

--

Rishi Agrawal

Aspiring MLOPS engineer with Multi-cloud and Flutter/MERN