Apache Sqoop - Import Relational Data to HDFS

Hello everyone, in the last article of Sqoop Import we have seen how can we use this command to transfer a single table to HDFS. There are cases when we do not want all the data from the table or not all the columns. In such cases, we can use a filter and/or free-form of queries. There is also a possibility when a table has no primary key. In that case, we need to tell Sqoop how it can split data between mappers. Let us see how can we use that in practice.

Sqoop import with filter

Consider a case where we want to filter rows from the table while importing it to HDFS. Or we do not want to bring all the columns from the table. We can use Sqoop arguments to get this job done.

For this demonstration, I am using 1 mapper for simplicity. The following are the content of “departments” table in MySQL.

Now we want to bring the first 5 departments only. We can specify this is where clause in Sqoop Import.

sqoop import --connect jdbc:mysql://localhost/employees
--username root -P
--table departments
--where 'dept_no &lt;= "d005"'
-m 1

sqoop import --connect jdbc:mysql://localhost/employees

--username root -P

--table departments

--where 'dept_no <= "d005"'

-m 1

In the output, we can see that only the first 5 departments have been transferred to HDFS. This is because we have specified that in where clause in the import command.

Checking HDFS data after running Sqoop import command

Sqoop import has columns argument which we can use to select the columns that we need to transfer to HDFS. This is useful when we don’t want entire table columns to be transferred to HDFS.

sqoop import --connect jdbc:mysql://localhost/employees
--username root -P
--table departments
--columns 'dept_no'
--target-dir 'dept_no'
-m 1

sqoop import --connect jdbc:mysql://localhost/employees

--username root -P

--table departments

--columns 'dept_no'

--target-dir 'dept_no'

-m 1

The above command will bring only “dept_no” column from “departments” table.

Sqoop import with free-form queries

We have a way to give queries instead of table names to Sqoop Imports. This is very useful when we want to perform some aggregation or join before transferring data to HDFS. Consider the case, we want to bring employees and their respective departments as well. We need to perform join employees, departments and dept_emp table.

sqoop import --connect jdbc:mysql://localhost/employees --username root -P
--query 'select e.emp_no,e.first_name,e.last_name,d.dept_name
from employees e join dept_emp de on de.emp_no = e.emp_no
join departments d on de.dept_no = d.dept_no
where $CONDITIONS'
-m 1
--target-dir empjoined

sqoop import --connect jdbc:mysql://localhost/employees --username root -P

--query 'select e.emp_no,e.first_name,e.last_name,d.dept_name

from employees e join dept_emp de on de.emp_no = e.emp_no

join departments d on de.dept_no = d.dept_no

where $CONDITIONS'

-m 1

--target-dir empjoined

When giving free-form of query we need to give target directory so that Sqoop knows where to keep data in HDFS. We also need to give ‘WHERE $CONDITIONS’ clause in a query. This is Sqoop syntax and it helps Sqoop split our data to different mappers. Interestingly we are using only one mapper in this case. If we try to use more than one Sqoop job will fail. Let us see why is that and how to get around it.

Sqoop Import with split by

As we have mentioned in the first article, Sqoop uses the primary key from the table to split data in different mappers. But what if the table does not have a primary key or it has a composite primary key or even in free-form of query there is no concept of the primary key. In all these cases, Sqoop cannot split data to mappers and if we use more than 1 mapper we will get an error.

But using 1 mapper is not efficient. So in such cases, we have to tell Sqoop which column it should use to split data. This is done by split by argument. We should choose column such that it has a fair distribution of values so Sqoop can split data equally. Let us take the above example and run it with split by argument.

sqoop import --connect jdbc:mysql://localhost/employees --username root -P
--query 'select e.emp_no,e.first_name,e.last_name,d.dept_name from employees e
join dept_emp de on de.emp_no = e.emp_no
join departments d on de.dept_no = d.dept_no
where $CONDITIONS' --target-dir empjoined_new
--split-by 'e.emp_no'

sqoop import --connect jdbc:mysql://localhost/employees --username root -P

--query 'select e.emp_no,e.first_name,e.last_name,d.dept_name from employees e

join dept_emp de on de.emp_no = e.emp_no

join departments d on de.dept_no = d.dept_no

where $CONDITIONS' --target-dir empjoined_new

--split-by 'e.emp_no'

Here is the output of this query. We can see that now our data is split into 4 files. This is due to we have suggested Sqoop split our data using the emp_no column in the employee’s table.

Running Sqoop import with split by filed

This is how we can use the Sqoop Import command to transfer data to HDFS. We can use different arguments as per our needs. In the next article, we will see how we can use Sqoop to imports the table directly to Hive.

Apache Sqoop: Import Data to HDFS – Part 2

Sqoop import with filter

Sqoop import with free-form queries

Sqoop Import with split by

How to set up Cloudera Quickstart VM on windows

Apache Sqoop Introduction

What is HDFS – Overview of Hadoop’s distributed file system

Reading data from a file in Spark

Hive – Advanced Aggregations with Grouping sets, Rollup and cube

Pivot rows to columns in Hive

Sqoop import with filter

Sqoop import with free-form queries

Sqoop Import with split by

Similar Posts