Hadoop Real-World Solutions Cookbook(Second Edition)
上QQ阅读APP看书,第一时间看更新

Adding new nodes to existing Hadoop clusters

Sometimes, it may happen that an existing Hadoop cluster's capacity is not adequate enough to handle all the data you may want to process. In this case, you can add new nodes to the existing Hadoop cluster without any downtime for the existing cluster. Hadoop supports horizontal scalability.

Getting ready

To perform this recipe, you should have a Hadoop cluster running. Also, you will need one more machine. If you are using AWS EC2, then you can launch an EC2 instance that's similar to what we did in the previous recipes. You will also need the same security group configurations in order to make the installation process smooth.

How to do it...

To add a new instance to an existing cluster, simply install and configure Hadoop the way we did for the previous recipe. Make sure that you put the same configurations in core-site.xml and yarn-site.xml, which will point to the correct master node.

Once all the configurations are done, simply execute commands to start the newly added datanode and nodemanager:

/usr/local/hadoop/sbin/hadoop-daemon.sh start datanode
/usr/local/hadoop/sbin/yarn-daemon.sh start nodemanager

If you take a look at the cluster again, you will find that the new node is registered. You can use the dfsadmin command to take a look at the number of nodes and amount of capacity that's been used:

hdfs dfsadmin -report

Here is a sample output for the preceding command:

How to do it...

How it works...

Hadoop supports horizontal scalability. If the resources that are being used are not enough, we can always go ahead and add new nodes to the existing cluster without hiccups. In Hadoop, it's always the slave that reports to the master. So, while making configurations, we always configure the details of the master and do nothing about the slaves. This architecture helps achieve horizontal scalability as at any point of time, we can add new nodes by only providing the configurations of the master, and everything else is taken care of by the Hadoop cluster. As soon as the daemons start, the master node realizes that a new node has been added and it becomes part of the cluster.