Providing Data Node Elasticity in Hadoop using LVM

Published in

Analytics Vidhya

6 min readDec 6, 2020

When setting up a Hadoop Cluster on AWS, launching more instances as data nodes increases the total storage provided. But what if we need to increase the space provided by an individual Data Node ? This is where we can use the Logical Volume Manager.

LVM is used for the following purposes:

Creating single logical volumes of multiple physical volumes or entire hard disks (somewhat similar to RAID 0, but more similar to JBOD), allowing for dynamic volume resizing.
Managing large hard disk farms by allowing disks to be added and replaced without downtime or service disruption, in combination with hot swapping.
On small systems (like a desktop), instead of having to estimate at installation time how big a partition might need to be, LVM allows filesystems to be easily resized as needed.
Performing consistent backups by taking snapshots of the logical volumes.
Encrypting multiple physical partitions with one password.

LVM can be considered as a thin software layer on top of the hard disks and partitions, which creates an abstraction of continuity and ease-of-use for managing hard drive replacement, repartitioning and backup.

Integrating LVM and Hadoop

Let’s cut to the chase — I’ll be using a simple two node setup (one name node and one data node) for my hadoop cluster. They’re both AWS EC2 instances that I’ll be working on over an SSH client (putty). Though I’ve personally used AWS instances, this concept can be implemented even on local VMs or any other cloud based compute service. Run the hadoop dfsadmin -report command (either on name node or data node) to check if your data node is connected.

[root@ip-172-31-38-8 ~]# hadoop dfsadmin -report
.
.
.
-------------------------------------------------
Datanodes available: 1 (1 total, 0 dead)Name: 65.0.76.100:50010
Decommission Status : Normal
Configured Capacity: 10724814848 (9.99 GB)
DFS Used: 8192 (8 KB)
Non DFS Used: 1739907072 (1.62 GB)
DFS Remaining: 8984899584(8.37 GB)
DFS Used%: 0%
DFS Remaining%: 83.78%
Last contact: Sat Dec 05 21:34:15 UTC 2020

Adding Physical Volumes

Now I’ll create a volume in AWS of size 10 GiB and attach it to my datanode, if doing this on a local VM, just add an additional disk of required size to the VM that’s running your data node.

Once the device is attached, we can open the terminal of our data node and proceed to create a partition of the added volume, format it with the right filesystem and finally mount the partition over a drive.

For the above mentioned steps, make sure lvm is installed on your system. If not, install using yum.

[root@ip-172–31–38–85 ~]# yum install lvm2

Creating a physical volume

List your attached disk devices using fdisk -l

[root@ip-172–31–38–85 ~]# fdisk -l
Disk /dev/xvda: 10 GiB, 10737418240 bytes, 20971520 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 246B752E-8CB4–41E7-B9B1–365A93ACF890Device Start End Sectors Size Type
/dev/xvda1 2048 4095 2048 1M BIOS boot
/dev/xvda2 4096 20971486 20967391 10G Linux filesystemDisk /dev/xvdf: 10 GiB, 10737418240 bytes, 20971520 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

The /dev/xvdf disk is our newly attached disk. We’ll now create a physical volume of that disk using the pvcreate command. You can display the created physical volume using pvdisplay command. This can be done to make multiple physical volumes.

[root@ip-172–31–38–85 ~]# pvcreate /dev/xvdf
 Physical volume “/dev/xvdf” successfully created.
[root@ip-172–31–38–85]# pvdisplay /dev/xvdf
 “/dev/xvdf” is a new physical volume of “10.00 GiB”
 — — NEW Physical volume — -
 PV Name /dev/xvdf
 VG Name
 PV Size 10.00 GiB
 Allocatable NO
 PE Size 0
 Total PE 0
 Free PE 0
 Allocated PE 0
 PV UUID jVlAEE-ATVj-b0At-8IHp-zoyy-IXtv-Mvhcnu

Creating a Volume Group

We can create multiple physical volumes, and then combine them to create a volume group. To create a volume group, use vgcreate <vgname> <pv1> <pv2> … and vgdisplay <vgname> to display the volume group.

[root@ip-172–31–38–85 ~]# vgcreate vgdata /dev/xvdf
 Volume group “vgdata” successfully created
[root@ip-172–31–38–85 ~]# vgdisplay vgdata
 — — Volume group — -
 VG Name vgdata
 System ID
 Format lvm2
 Metadata Areas 1
 Metadata Sequence No 1
 VG Access read/write
 VG Status resizable
 MAX LV 0
 Cur LV 0
 Open LV 0
 Max PV 0
 Cur PV 1
 Act PV 1
 VG Size <10.00 GiB
 PE Size 4.00 MiB
 Total PE 2559
 Alloc PE / Size 0 / 0
 Free PE / Size 2559 / <10.00 GiB
 VG UUID FLipJ0-kzBs-lW2v-IvjG-2HxN-1U1F-HSJwmj

Creating and Mounting Logical Volumes

We can now create logical volumes from our volume group using the command lvcreate --size <size> --name <lvname> <vgname>. Here, I’m creating a logical volume of size 5 GiB and displaying it using lvdisplay <vgname>/<lvname>

[root@ip-172-31-38-85 ~]# lvcreate --size 5G --name lvdata1 vgdata
  Logical volume "lvdata1" created.
[root@ip-172–31–38–85 ~]# lvdisplay vgdata/lvdata1
 — — Logical volume — -
 LV Path /dev/vgdata/lvdata1
 LV Name lvdata1
 VG Name vgdata
 LV UUID 1C08aW-rTGn-CTjz-UiGg-ksRM-79FK-7WWVNS
 LV Write Access read/write
 LV Creation host, time ip-172–31–38–85.ap-south-1.compute.internal, 2020–12–05 23:38:10 +0000
 LV Status available
 # open 0
 LV Size 5.00 GiB
 Current LE 1280
 Segments 1
 Allocation inherit
 Read ahead sectors auto
 — currently set to 8192
 Block device 253:0

Next, let’s format the logical volume with the ext4 filesystem. For this we use the command mkfs.ext4 <LV Path>.

[root@ip-172–31–38–85 ~]# mkfs.ext4 /dev/vgdata/lvdata1
mke2fs 1.45.6 (20-Mar-2020)
Creating filesystem with 1310720 4k blocks and 327680 inodes
Filesystem UUID: cac48375–3f2f-4c81–9b53–4336043bf423
Superblock backups stored on blocks:
 32768, 98304, 163840, 229376, 294912, 819200, 884736Allocating group tables: done
Writing inode tables: done
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done

We generally create a directory where we can mount our formatted volume but in this case, since we want to increase elasticity of our data node, we will mount the volume to our data node directory. You can verify the location of your data node directory in the hdfs-site.xml file in your hadoop directory.

[root@ip-172–31–38–85 ~]# cat /etc/hadoop/hdfs-site.xml
<?xml version=”1.0"?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?><! — Put site-specific property overrides in this file. →<configuration>
 <property>
 <name>dfs.data.dir</name>
 <value>/dn</value>
 </property>
</configuration>

In my case, the data node directory is /dn. So I’ll mount the volume to this directory using the mount command mount /dev/mapper/<vgname>-<lvname> <mountpoint>.
Confirm the mount by using the command df -h.

[root@ip-172-31-38-85 ~]# mount /dev/mapper/vgdata-lvdata1 /dn

Extending the Logical Volume

We can further add storage capacity to our logical volume from the volume group without having to format the already mounted logical volume. This is done by using two commands: lvextend and resize2fs.

We use lvextend --size +<size> <devicepath> to add extend the logical volume.

[root@ip-172–31–38–85 ~]# lvextend — size +2G /dev/mapper/vgdata-lvdata1 Size of logical volume vgdata/lvdata1 changed from 5.00 GiB (1280 extents) to 7.00 GiB (1792 extents).
 Logical volume vgdata/lvdata1 successfully resized.

To make the extended volume usable, we use resize2fs <devicepath>

[root@ip-172–31–38–85 hadoop]# resize2fs /dev/mapper/vgdata-lvdata1
resize2fs 1.45.6 (20-Mar-2020)
Filesystem at /dev/mapper/vgdata-lvdata1 is mounted on /dn; on-line resizing required
old_desc_blocks = 1, new_desc_blocks = 1
The filesystem on /dev/mapper/vgdata-lvdata1 is now 1835008 (4k) blocks long.

Finally run the df -h command to verify the extended volume.

[root@ip-172–31–38–85 hadoop]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 378M 0 378M 0% /dev
tmpfs 403M 0 403M 0% /dev/shm
tmpfs 403M 11M 393M 3% /run
tmpfs 403M 0 403M 0% /sys/fs/cgroup
/dev/xvda2 10G 1.7G 8.4G 17% /
tmpfs 81M 0 81M 0% /run/user/1000
/dev/mapper/vgdata-lvdata1 6.9G 23M 6.5G 1% /dn

Now on running the hadoop dfsadmin -report command, we see that the storage contributed by our data node is now around 6.8 GB.

[root@ip-172–31–38–8 ~]# hadoop dfsadmin -report
.
.
.-------------------------------------------------
Datanodes available: 1 (1 total, 0 dead)Name: 13.232.125.253:50010
Decommission Status : Normal
Configured Capacity: 7331110912 (6.83 GB)
DFS Used: 45056 (44 KB)
Non DFS Used: 405405696 (386.62 MB)
DFS Remaining: 6925660160(6.45 GB)
DFS Used%: 0%
DFS Remaining%: 94.47%
Last contact: Sun Dec 06 21:08:07 UTC 2020

Conclusion

We can attach more physical volumes and add them to the Volume Group and further extend the size of the logical volume mounted on our data node directory to make our data node storage resizable on the go. This is how we can provide data node elasticity on hadoop using LVM.