Apache Sqoop: Import Data to HDFS – Part 2
Hello everyone, in the last article of Sqoop Import we have seen how can we use this command to transfer a single table to HDFS. There are cases when we do not want all the data from the table or not all the columns. In such cases, we can use a filter and/or free-form of queries. There is also a possibility when a table has no primary key. In that case, we need to tell Sqoop how it can split data between mappers. Let us see how can we use that in practice.
Sqoop import with filter
Consider a case where we want to filter rows from the table while importing it to HDFS. Or we do not want to bring all the columns from the table. We can use Sqoop arguments to get this job done.
For this demonstration, I am using 1 mapper for simplicity. The following are the content of “departments” table in MySQL.
Now we want to bring the first 5 departments only. We can specify this is where clause in Sqoop Import.
1 2 3 4 5 |
sqoop import --connect jdbc:mysql://localhost/employees --username root -P --table departments --where 'dept_no <= "d005"' -m 1 |
In the output, we can see that only the first 5 departments have been transferred to HDFS. This is because we have specified that in where clause in the import command.
Sqoop import has columns argument which we can use to select the columns that we need to transfer to HDFS. This is useful when we don’t want entire table columns to be transferred to HDFS.
1 2 3 4 5 6 |
sqoop import --connect jdbc:mysql://localhost/employees --username root -P --table departments --columns 'dept_no' --target-dir 'dept_no' -m 1 |
The above command will bring only “dept_no” column from “departments” table.
Sqoop import with free-form queries
We have a way to give queries instead of table names to Sqoop Imports. This is very useful when we want to perform some aggregation or join before transferring data to HDFS. Consider the case, we want to bring employees and their respective departments as well. We need to perform join employees, departments and dept_emp table.
1 2 3 4 5 6 7 |
sqoop import --connect jdbc:mysql://localhost/employees --username root -P --query 'select e.emp_no,e.first_name,e.last_name,d.dept_name from employees e join dept_emp de on de.emp_no = e.emp_no join departments d on de.dept_no = d.dept_no where $CONDITIONS' -m 1 --target-dir empjoined |
When giving free-form of query we need to give target directory so that Sqoop knows where to keep data in HDFS. We also need to give ‘WHERE $CONDITIONS’ clause in a query. This is Sqoop syntax and it helps Sqoop split our data to different mappers. Interestingly we are using only one mapper in this case. If we try to use more than one Sqoop job will fail. Let us see why is that and how to get around it.
Sqoop Import with split by
As we have mentioned in the first article, Sqoop uses the primary key from the table to split data in different mappers. But what if the table does not have a primary key or it has a composite primary key or even in free-form of query there is no concept of the primary key. In all these cases, Sqoop cannot split data to mappers and if we use more than 1 mapper we will get an error.
But using 1 mapper is not efficient. So in such cases, we have to tell Sqoop which column it should use to split data. This is done by split by argument. We should choose column such that it has a fair distribution of values so Sqoop can split data equally. Let us take the above example and run it with split by argument.
1 2 3 4 5 6 |
sqoop import --connect jdbc:mysql://localhost/employees --username root -P --query 'select e.emp_no,e.first_name,e.last_name,d.dept_name from employees e join dept_emp de on de.emp_no = e.emp_no join departments d on de.dept_no = d.dept_no where $CONDITIONS' --target-dir empjoined_new --split-by 'e.emp_no' |
Here is the output of this query. We can see that now our data is split into 4 files. This is due to we have suggested Sqoop split our data using the emp_no column in the employee’s table.
This is how we can use the Sqoop Import command to transfer data to HDFS. We can use different arguments as per our needs. In the next article, we will see how we can use Sqoop to imports the table directly to Hive.