Apache Sqoop Introduction

One reason which made Hadoop ecosystem popular is its ability to process different forms of data. But not all data is present in HDFS i.e Hadoop Distributed File System. We have been using relational databases to store and process structured data for a long time. That is why a lot of data still resides in RDBMS and we need some tool to bring that data to HDFS. To solve this problem, We can use Sqoop.

What is Sqoop

Apache Sqoop is an open-source tool that helps user transfer data between structured data sources and Hadoop. Using it, we can extract data from RDBMS, load it to Hadoop, process it and again store it back to RDBMS.

Sqoop uses map-reduce to import and export data which provides parallel operation. After we submit the Sqoop command, it converts that to a map-reduce job and submits it to the Hadoop cluster. Hadoop cluster then executes that job.

Sqoop commands create Java classes that have the structure of data that is being transferred. We can use Java source code and classes which are generated in the same directory from where we invoke Sqoop command.

Sqoop Tools

We can use the command ‘$sqoop help’ to display the collection of Sqoop tools.

We can see all the options and configuration parameters of each tool by using ‘$sqoop –help’.

Apache sqoop getting help for all commands

We can check connectivity with a relational database or run an ad-hoc query using ‘eval’ tool.

For this, we need to give a JDBC connection URL of the database. Along with that, we also need to provide username and password which will be used to connect to our database. We can use ‘eval’ to list all tables in the “employees” database which is already present in MySQL.

sqoop eval --connect jdbc:mysql://localhost/employees --username root -P -e 'show tables;'

1	sqoop eval --connect jdbc:mysql://localhost/employees --username root -P -e 'show tables;'

Here we have used localhost in our connection string that is because MySQL is on the same machine as Sqoop. If our MySQL database resides on another machine we will need to give its address. You can directly give the password in the command line using –password argument but this can be risky in the production environment.

We do have a better way to pass connection string, username, and password. We can store all this information in one file and use that file as an argument in the Sqoop command. Consider we have the following conf file.

#connection string
--connect
jdbc:mysql://localhost/employees
#username of database
--username
root
#password for database
--password
cloudera

#connection string

--connect

jdbc:mysql://localhost/employees

#username of database

--username

root

#password for database

--password

cloudera

We can rewrite eval command like this using the conf file.

sqoop eval --options-file conf -e 'show databases;'

1	sqoop eval --options-file conf -e 'show databases;'

simple isn’t it?

We have seen what Sqoop is and it’s basic commands. We will explore it in detail in the next few articles in this section.

Apache Sqoop Introduction

What is Sqoop

Sqoop Tools

How to set up Cloudera Quickstart VM on windows

Apache Sqoop – Import data to HDFS

What is HDFS – Overview of Hadoop’s distributed file system

Hive – What is the difference between Collect Set and Collect List

Hive – Advanced Aggregations with Grouping sets, Rollup and cube

Apache Sqoop: Import Data to HDFS – Part 2

What is Sqoop

Sqoop Tools

Similar Posts