Apache Sqoop: Import Data to HDFS – Part 2

Hello everyone, in the last article of Sqoop Import we have seen how can we use this command to transfer a single table to HDFS. There are cases when we do not want all the data from the table or not all the columns. In such cases, we can use a filter and/or free-form of queries. There is also a possibility when a table has no primary key. In that case, we need to tell Sqoop how it can split data between mappers. Let us see how can we use that in practice.

Sqoop import with filter

Consider a case where we want to filter rows from the table while importing it to HDFS. Or we do not want to bring all the columns from the table. We can use Sqoop arguments to get this job done.

For this demonstration, I am using 1 mapper for simplicity. The following are the content of “departments” table in MySQL.

Checking all tables in MySQL databases

Now we want to bring the first 5 departments only. We can specify this is where clause in Sqoop Import.

In the output, we can see that only the first 5 departments have been transferred to HDFS. This is because we have specified that in where clause in the import command.

Checking HDFS data after running Sqoop import command

Sqoop import has columns argument which we can use to select the columns that we need to transfer to HDFS. This is useful when we don’t want entire table columns to be transferred to HDFS.

The above command will bring only “dept_no” column from “departments” table.

Sqoop import with free-form queries

We have a way to give queries instead of table names to Sqoop Imports. This is very useful when we want to perform some aggregation or join before transferring data to HDFS. Consider the case, we want to bring employees and their respective departments as well. We need to perform join employees, departments and dept_emp table.

When giving free-form of query we need to give target directory so that Sqoop knows where to keep data in HDFS. We also need to give ‘WHERE $CONDITIONS’ clause in a query. This is Sqoop syntax and it helps Sqoop split our data to different mappers. Interestingly we are using only one mapper in this case. If we try to use more than one Sqoop job will fail. Let us see why is that and how to get around it.

Sqoop Import with split by

As we have mentioned in the first article, Sqoop uses the primary key from the table to split data in different mappers. But what if the table does not have a primary key or it has a composite primary key or even in free-form of query there is no concept of the primary key. In all these cases, Sqoop cannot split data to mappers and if we use more than 1 mapper we will get an error.

But using 1 mapper is not efficient. So in such cases, we have to tell Sqoop which column it should use to split data. This is done by split by argument. We should choose column such that it has a fair distribution of values so Sqoop can split data equally.  Let us take the above example and run it with split by argument.

Here is the output of this query. We can see that now our data is split into 4 files. This is due to we have suggested Sqoop split our data using the emp_no column in the employee’s table.

Running Sqoop import with split by filed

This is how we can use the Sqoop Import command to transfer data to HDFS. We can use different arguments as per our needs. In the next article, we will see how we can use Sqoop to imports the table directly to Hive.

Similar Posts