Apache Sqoop - Import data to HDFS

Updated On February 12, 2021 | By Mahesh Mogal

Hello everyone. In this article, we will explore the Sqoop Import command. This command is used to transfer data from RDBMS to Hadoop cluster. We can use the Sqoop Import tool to transfer one table at a time. We can choose the file format in which data will be stored in HDFS. There is import all tables variant of this tool which imports all the tables from a database to HDFS.

Sqoop Import

The import command is used to import a single table from the database to HDFS. Let's look into it with an example.

This Sqoop command will transfer the "employees" table from MySQL to HDFS. In the home path at HDFS, there will be a directory created with "employees" name which will have all the transferred data.

Importing table to HDFS using Sqoop Import

Sqoop Import Arguments

We can changeĀ the name of the directory where Sqoop is importing data in HDFS. Also, we can keep all the folders of the import jobs at a single location by specifying the warehouse directory path.

Remember you can specify the target directory and warehouse directory in the same import.

We can see output has been split into 4 parts. This is because by default Sqoop MapReduce job launches 4 Mappers. Sqoop divides all data equally in those 4 mappers so that all mappers have the same load. Sqoop uses the Primary Key of the table to decide how to split data into mappers. Suppose in this employee table, emp id is the primary key and it has 1000 values. Then Sqoop will split this table among 4 Mappers such that each Mapper will transfer 250 records.

We can change the number of mappers in the Sqoop command bypassing the argument '-m ' and the number of mappers we want. It is tempting to specify a large number of mappers but this may consume all the resources on RDBMS and HDFS which in turn slows down all the operations.

We can also specify the file format in which we want to store data in HDFS. Import tool supportsĀ 4 types of file formats.

We can also specify the compression setting while importing data to HDFS.

Let us move "employees" table again to another directory in parquet format using some of the parameters we have mentioned above.

We have used parquet file format with compression and we are using 6 mappers for this job. We can see that in the following output.

Importing table to HDFS with compression using Sqoop Import

This is how we can use the Sqoop import tool to transfer a single table to HDFS. There are still some important parameters remaining in the Sqoop Import command. We will see those in the next article.

Mahesh Mogal

I am passionate about Cloud, Data Analytics, Machine Learning, and Artificial Intelligence. I like to learn and try out new things. I have started blogging about my experience while learning these exciting technologies.

Stay Updated with Latest Blogs

Get latest blogs delivered to your mail directly.

Recent Posts

Spark Join Types With Examples

In this blog, we are going to learn different spark join types. We will also write code and validate data output for each join type to better understand them.

Read More
Integrate Spark with Jupyter Notebook and Visual Studio Code

In this blog, we are going to integrate spark with jupyter notebook and visual studio code to create easy-to-use development environment.

Read More
Reading Data From SQL Tables in Spark

In this blog, we are going to learn about reading data from SQL tables in Spark. We will create Spark data frames from tables and query results as well.

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *

linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram