
上QQ阅读APP看书,第一时间看更新
Loading data from a local machine to HDFS
In this recipe, we are going to load data from a local machine's disk to HDFS.
Getting ready
To perform this recipe, you should have an already Hadoop running cluster.
How to do it...
Performing this recipe is as simple as copying data from one folder to another. There are a couple of ways to copy data from the local machine to HDFS.
- Using the
copyFromLocal
command- To copy the file on HDFS, let's first create a directory on HDFS and then copy the file. Here are the commands to do this:
hadoop fs -mkdir /mydir1 hadoop fs -copyFromLocal /usr/local/hadoop/LICENSE.txt /mydir1
- To copy the file on HDFS, let's first create a directory on HDFS and then copy the file. Here are the commands to do this:
- Using the
put
command- We will first create the directory, and then put the local file in HDFS:
hadoop fs -mkdir /mydir2 hadoop fs -put /usr/local/hadoop/LICENSE.txt /mydir2
- We will first create the directory, and then put the local file in HDFS:
You can validate that the files have been copied to the correct folders by listing the files:
hadoop fs -ls /mydir1 hadoop fs -ls /mydir2
How it works...
When you use HDFS copyFromLocal
or the put
command, the following things will occur:
- First of all, the HDFS client (the command prompt, in this case) contacts
NameNode
because it needs to copy the file to HDFS. NameNode
then asks the client to break the file into chunks of different cluster block sizes. In Hadoop 2.X, the default block size is 128MB.- Based on the capacity and availability of space in
DataNodes
,NameNode
will decide where these blocks should be copied. - Then, the client starts copying data to specified
DataNodes
for a specific block. The blocks are copied sequentially one after another. - When a single block is copied, the block is sent to
DataNode
in packets that are 4MB in size. With each packet, a checksum is sent; once the packet copying is done, it is verified with checksum to check whether it matches. The packets are then sent to the nextDataNode
where the block will be replicated. - The HDFS client's responsibility is to copy the data to only the first node; the replication is taken care by respective
DataNode
. Thus, the data block is pipelined from oneDataNode
to the next. - When the block copying and replication is taking place, metadata on the file is updated in
NameNode
byDataNode
.