Apache Sqoop – Import data to HDFS

Hello everyone. In this article, we will explore the Sqoop Import command. This command is used to transfer data from RDBMS to Hadoop cluster. We can use the Sqoop Import tool to transfer one table at a time. We can choose the file format in which data will be stored in HDFS. There is import all tables variant of this tool which imports all the tables from a database to HDFS.

Sqoop Import

The import command is used to import a single table from the database to HDFS. Let’s look into it with an example.

This Sqoop command will transfer the “employees” table from MySQL to HDFS. In the home path at HDFS, there will be a directory created with “employees” name which will have all the transferred data.

Importing table to HDFS using Sqoop Import

Sqoop Import Arguments

We can change the name of the directory where Sqoop is importing data in HDFS. Also, we can keep all the folders of the import jobs at a single location by specifying the warehouse directory path.

Remember you can specify the target directory and warehouse directory in the same import.

We can see output has been split into 4 parts. This is because by default Sqoop MapReduce job launches 4 Mappers. Sqoop divides all data equally in those 4 mappers so that all mappers have the same load. Sqoop uses the Primary Key of the table to decide how to split data into mappers. Suppose in this employee table, emp id is the primary key and it has 1000 values. Then Sqoop will split this table among 4 Mappers such that each Mapper will transfer 250 records.

We can change the number of mappers in the Sqoop command bypassing the argument ‘-m ‘ and the number of mappers we want. It is tempting to specify a large number of mappers but this may consume all the resources on RDBMS and HDFS which in turn slows down all the operations.

We can also specify the file format in which we want to store data in HDFS. Import tool supports 4 types of file formats.

We can also specify the compression setting while importing data to HDFS.

Let us move “employees” table again to another directory in parquet format using some of the parameters we have mentioned above.

We have used parquet file format with compression and we are using 6 mappers for this job. We can see that in the following output.

Importing table to HDFS with compression using Sqoop Import

This is how we can use the Sqoop import tool to transfer a single table to HDFS. There are still some important parameters remaining in the Sqoop Import command. We will see those in the next article.

Similar Posts