Apache Sqoop – Import data to HDFS
Hello everyone. In this article, we will explore the Sqoop Import command. This command is used to transfer data from RDBMS to Hadoop cluster. We can use the Sqoop Import tool to transfer one table at a time. We can choose the file format in which data will be stored in HDFS. There is import all tables variant of this tool which imports all the tables from a database to HDFS.
Sqoop Import
The import command is used to import a single table from the database to HDFS. Let’s look into it with an example.
1 2 3 |
sqoop import --connect jdbc:mysql://localhost/employees --username root -P --table employees |
This Sqoop command will transfer the “employees” table from MySQL to HDFS. In the home path at HDFS, there will be a directory created with “employees” name which will have all the transferred data.
Sqoop Import Arguments
We can change the name of the directory where Sqoop is importing data in HDFS. Also, we can keep all the folders of the import jobs at a single location by specifying the warehouse directory path.
1 2 |
--target-dir # destination directory in HDFS --warehouse-dir # HDFS parent directory for import table directory |
Remember you can specify the target directory and warehouse directory in the same import.
We can see output has been split into 4 parts. This is because by default Sqoop MapReduce job launches 4 Mappers. Sqoop divides all data equally in those 4 mappers so that all mappers have the same load. Sqoop uses the Primary Key of the table to decide how to split data into mappers. Suppose in this employee table, emp id is the primary key and it has 1000 values. Then Sqoop will split this table among 4 Mappers such that each Mapper will transfer 250 records.
We can change the number of mappers in the Sqoop command bypassing the argument ‘-m ‘ and the number of mappers we want. It is tempting to specify a large number of mappers but this may consume all the resources on RDBMS and HDFS which in turn slows down all the operations.
We can also specify the file format in which we want to store data in HDFS. Import tool supports 4 types of file formats.
1 2 3 4 |
--as-avrodatafile # to store in avro format --as-sequencefile # to store in sequence file format --as-textfile # to store in plain text format (Default format) --as-parquetfile # to store in parquet format |
We can also specify the compression setting while importing data to HDFS.
1 2 |
-z, --compress # to enable compression --compression-codec # use valid Hadoop codec. (default gzip) |
Let us move “employees” table again to another directory in parquet format using some of the parameters we have mentioned above.
1 2 3 4 5 6 7 |
sqoop import --connect jdbc:mysql://localhost/employees --username root -P --table employees --as-parquetfile -m 6 --compress --compression-codec 'gzip' --target-dir 'emp_new' |
We have used parquet file format with compression and we are using 6 mappers for this job. We can see that in the following output.
This is how we can use the Sqoop import tool to transfer a single table to HDFS. There are still some important parameters remaining in the Sqoop Import command. We will see those in the next article.