Hive Data Manipulation - Loading Data to Hive Tables

Updated On August 11, 2020 | By Mahesh Mogal

By now, we have created different types of tables with a variety of data types in the Hive. But we have not seen how we can populate these tables with data either from files in our local system or some already existing data in HDFS. In today's tutorial, we will learn about different ways in which we can fill data into Hive tables.

Loading Local System Data to HDFS Table

For this tutorial, consider simple table below which does not have any data in it now.

hive (maheshmogal)> CREATE TABLE employee(id INT, name STRING, state STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
OK
Time taken: 0.382 seconds
hive (maheshmogal)>

Using HDFS command below, we can copy data from our local system into a Hive table. We can either specify OVERWRITE which clears existing data in table and loads new data or we can omit OVERWRITE which appends new data to existing table data. Let us see it in action below.

hive (maheshmogal)> LOAD DATA LOCAL INPATH 'emp.txt' OVERWRITE INTO TABLE employee;
Loading data to table maheshmogal.employee
Table maheshmogal.employee stats: [numFiles=1, numRows=0, totalSize=27, rawDataSize=0]
OK
Time taken: 0.959 seconds
hive (maheshmogal)> select * from employee;
OK
1       abc     CA
2       xyz     NY
3       pqr     CA
Time taken: 0.228 seconds, Fetched: 3 row(s)
hive (maheshmogal)>

Here we can see that the employee table has been loaded with data from our local file system. We can also try the same same query without the OVERWRITE clause.

-- command without overwrite will append data to existing table. That is why we have duplicates in table
hive (maheshmogal)> LOAD DATA LOCAL INPATH 'emp.txt' INTO TABLE employee;
Loading data to table maheshmogal.employee
Table maheshmogal.employee stats: [numFiles=2, numRows=0, totalSize=54, rawDataSize=0]
OK
Time taken: 1.203 seconds
hive (maheshmogal)> select * from employee;
OK
1       abc     CA
2       xyz     NY
3       pqr     CA
1       abc     CA
2       xyz     NY
3       pqr     CA
Time taken: 0.677 seconds, Fetched: 6 row(s)
-- here we can see that when we use OVERWRITE old data is delayed and only new data is in table.
hive (maheshmogal)> LOAD DATA LOCAL INPATH 'emp.txt' OVERWRITE INTO TABLE employee;
Loading data to table maheshmogal.employee
Table maheshmogal.employee stats: [numFiles=1, numRows=0, totalSize=27, rawDataSize=0]
OK
Time taken: 1.144 seconds
hive (maheshmogal)> select * from employee;
OK
1       abc     CA
2       xyz     NY
3       pqr     CA
Time taken: 0.218 seconds, Fetched: 3 row(s)
hive (maheshmogal)>

Loading Data From HDFS Location into Hive Table

We can use the same command as above to load data from HDFS location to Hive table. We only have to remove the "LOCAL" keyword from command. There is another thing to note while loading data from HDFS location to hive table.

When we load data from Local system to Hive table LOAD DATA command copies(copy+paste) file from system to HDFS. But when we use LOAD DATA with HDFS file location then it moves (cut + paste)data from source location to target location.

Hive Facts
hive (maheshmogal)> dfs -ls ;
Found 10 items
drwx------   - maheshmogal hdfs          0 2020-05-26 08:04 .Trash
drwx------   - maheshmogal hdfs          0 2020-05-26 02:24 .staging
-rw-r--r--   2 maheshmogal hdfs         27 2020-05-26 13:20 emp.txt
hive (maheshmogal)> LOAD DATA INPATH '/user/maheshmogal/emp.txt' OVERWRITE INTO TABLE employee;
Loading data to table maheshmogal.employee
Table maheshmogal.employee stats: [numFiles=1, numRows=0, totalSize=27, rawDataSize=0]
OK
Time taken: 1.348 seconds
hive (maheshmogal)> dfs -ls ;
Found 9 items
drwx------   - maheshmogal hdfs          0 2020-05-26 08:04 .Trash
drwx------   - maheshmogal hdfs          0 2020-05-26 02:24 .staging
hive (maheshmogal)>

In the above example, we can see that the LOAD data command has copied the file from our HDFS location into the hive table and also deleted that file from the original location.

Loading Data From One Table to Another Table in Hive

We can also load data from one hive table to another. you may have one large table with all kinds of information and you need very specific data from it. In that case you can use the INSERT query like below to add data to your table.

hive (maheshmogal)> create table sample_emp (id int, name string);
OK
Time taken: 0.537 seconds
hive (maheshmogal)> INSERT OVERWRITE TABLE sample_emp
                  > SELECT id, name FROM employee;
Query ID = maheshmogal_20200526132900_e2a5c20f-bcce-4dc6-9247-fe7cfacb189c
Total jobs = 3
Launching Job 1 out of 3
-- map reduce output 
Time taken: 19.256 seconds
hive (maheshmogal)> select * from sample_emp;
OK
1       abc
2       xyz
3       pqr
Time taken: 0.248 seconds, Fetched: 3 row(s)

Insert command will trigger a map-reduce job and insert data from a source table to a destination table. Just like the LOAD DATA command we can use the INTO clause instead of OVERWRITE, which will append data to an existing table.

We do not need to create a table beforehand to insert data into it. We can use the hive's Create Table as a syntax to create table on the fly and insert data into it.

hive (maheshmogal)> drop table sample_emp;
OK
Time taken: 0.711 seconds
hive (maheshmogal)> CREATE TABLE sample_emp AS
                  > SELECT id, name FROM employee;
Query ID = maheshmogal_20200526133635_621e51d9-b86f-4938-856d-13fc13a5bb22
Total jobs = 3
Launching Job 1 out of 3
-- MR logs
hive (maheshmogal)> select * from sample_emp;
OK
1       abc
2       xyz
3       pqr
Time taken: 0.233 seconds, Fetched: 3 row(s)
hive (maheshmogal)>

Multi Insert Command

In hive we can use multiple insert commands in a single query. This is useful when we want to scan the entire table once and divide it into smaller set of tables in one single query.

Consider the example, you have large tables with employees and you want to create a separate table for each state which has all employees of that state. We can do it with where clause and adding employees to the respective table. However this will cause our large table to be scanned multiple times. Instead of that, we can use the query below to process this requirement efficiently.

hive (maheshmogal)> FROM employee e
                  > INSERT OVERWRIE TABLE emp_ca
                  > SELECT * WHERE e.state = 'CA'
                  > INSERT OVERWRITE TABLE emp_ny
                  > SELECT * WHERE e.state = 'NY';

This feature is quite powerful as we can populate multiple tables in a single query. When we will learn about Partitions in the Hive, this feature will be even more useful to dynamically create partitions and fill them with proper data.

Conclusion

In this article, we have learned how to insert data into hive tables using different ways. I hope you have found this useful. See you in the next blog.

Loading Data to Hive Tables
.

Mahesh Mogal

I am passionate about Cloud, Data Analytics, Machine Learning, and Artificial Intelligence. I like to learn and try out new things. I have started blogging about my experience while learning these exciting technologies.

Stay Updated with Latest Blogs

Get latest blogs delivered to your mail directly.

Recent Posts

Select Expr in Spark Dataframe

In this blog, we will learn how to use select and expr in the Spark data frame. We will learn multiple use cases along with selectExpr.

Read More
Add, Rename, Drop Columns in Spark Dataframe

We will go through common column operations like add, rename, list, select, and dropping a column from spark dataframe.

Read More
MSCK Repair - Fixing Partitions in Hive Table

We will learn how to add multiple partitions to hive table using msck repair table command in hive.

msck repair hive
Read More

Leave a Reply

Your email address will not be published. Required fields are marked *

linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram
Share via
Copy link
Powered by Social Snap