Hive Data Manipulation – Loading Data to Hive Tables
By now, we have created different types of tables with a variety of data types in the Hive. But we have not seen how we can populate these tables with data either from files in our local system or some already existing data in HDFS. In today’s tutorial, we will learn about different ways in which we can fill data into Hive tables.
Loading Local System Data to HDFS Table
For this tutorial, consider simple table below which does not have any data in it now.
1 2 3 |
CREATE TABLE employee(id INT, name STRING, state STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; OK Time taken: 0.382 seconds |
Using HDFS command below, we can copy data from our local system into a Hive table. We can either specify OVERWRITE which clears existing data in table and loads new data or we can omit OVERWRITE which appends new data to existing table data. Let us see it in action below.
1 2 3 4 5 6 7 8 9 10 11 |
hive (maheshmogal)> LOAD DATA LOCAL INPATH 'emp.txt' OVERWRITE INTO TABLE employee; Loading data to table maheshmogal.employee Table maheshmogal.employee stats: [numFiles=1, numRows=0, totalSize=27, rawDataSize=0] OK Time taken: 0.959 seconds hive (maheshmogal)> select * from employee; OK 1 abc CA 2 xyz NY 3 pqr CA Time taken: 0.228 seconds, Fetched: 3 row(s) |
Here we can see that the employee table has been loaded with data from our local file system. We can also try the same same query without the OVERWRITE clause.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
-- command without overwrite will append data to existing table. That is why we have duplicates in table hive (maheshmogal)> LOAD DATA LOCAL INPATH 'emp.txt' INTO TABLE employee; Loading data to table maheshmogal.employee Table maheshmogal.employee stats: [numFiles=2, numRows=0, totalSize=54, rawDataSize=0] OK Time taken: 1.203 seconds hive (maheshmogal)> select * from employee; OK 1 abc CA 2 xyz NY 3 pqr CA 1 abc CA 2 xyz NY 3 pqr CA Time taken: 0.677 seconds, Fetched: 6 row(s) -- here we can see that when we use OVERWRITE old data is delayed and only new data is in table. hive (maheshmogal)> LOAD DATA LOCAL INPATH 'emp.txt' OVERWRITE INTO TABLE employee; Loading data to table maheshmogal.employee Table maheshmogal.employee stats: [numFiles=1, numRows=0, totalSize=27, rawDataSize=0] OK Time taken: 1.144 seconds hive (maheshmogal)> select * from employee; OK 1 abc CA 2 xyz NY 3 pqr CA Time taken: 0.218 seconds, Fetched: 3 row(s) hive (maheshmogal)> |
Loading Data From HDFS Location into Hive Table
We can use the same command as above to load data from HDFS location to Hive table. We only have to remove the “LOCAL” keyword from command. There is another thing to note while loading data from HDFS location to hive table.
When we load data from Local system to Hive table LOAD DATA command copies(copy+paste) file from system to HDFS. But when we use LOAD DATA with HDFS file location then it moves (cut + paste)data from source location to target location.
Hive Facts
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
hive (maheshmogal)> dfs -ls ; Found 10 items drwx------ - maheshmogal hdfs 0 2020-05-26 08:04 .Trash drwx------ - maheshmogal hdfs 0 2020-05-26 02:24 .staging -rw-r--r-- 2 maheshmogal hdfs 27 2020-05-26 13:20 emp.txt hive (maheshmogal)> LOAD DATA INPATH '/user/maheshmogal/emp.txt' OVERWRITE INTO TABLE employee; Loading data to table maheshmogal.employee Table maheshmogal.employee stats: [numFiles=1, numRows=0, totalSize=27, rawDataSize=0] OK Time taken: 1.348 seconds hive (maheshmogal)> dfs -ls ; Found 9 items drwx------ - maheshmogal hdfs 0 2020-05-26 08:04 .Trash drwx------ - maheshmogal hdfs 0 2020-05-26 02:24 .staging hive (maheshmogal)> |
In the above example, we can see that the LOAD data command has copied the file from our HDFS location into the hive table and also deleted that file from the original location.
Loading Data From One Table to Another Table in Hive
We can also load data from one hive table to another. you may have one large table with all kinds of information and you need very specific data from it. In that case you can use the INSERT query like below to add data to your table.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
hive (maheshmogal)> create table sample_emp (id int, name string); OK Time taken: 0.537 seconds hive (maheshmogal)> INSERT OVERWRITE TABLE sample_emp > SELECT id, name FROM employee; Query ID = maheshmogal_20200526132900_e2a5c20f-bcce-4dc6-9247-fe7cfacb189c Total jobs = 3 Launching Job 1 out of 3 -- map reduce output Time taken: 19.256 seconds hive (maheshmogal)> select * from sample_emp; OK 1 abc 2 xyz 3 pqr Time taken: 0.248 seconds, Fetched: 3 row(s) |
Insert command will trigger a map-reduce job and insert data from a source table to a destination table. Just like the LOAD DATA command we can use the INTO clause instead of OVERWRITE, which will append data to an existing table.
We do not need to create a table beforehand to insert data into it. We can use the hive’s Create Table as a syntax to create table on the fly and insert data into it.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
hive (maheshmogal)> drop table sample_emp; OK Time taken: 0.711 seconds hive (maheshmogal)> CREATE TABLE sample_emp AS > SELECT id, name FROM employee; Query ID = maheshmogal_20200526133635_621e51d9-b86f-4938-856d-13fc13a5bb22 Total jobs = 3 Launching Job 1 out of 3 -- MR logs hive (maheshmogal)> select * from sample_emp; OK 1 abc 2 xyz 3 pqr Time taken: 0.233 seconds, Fetched: 3 row(s) hive (maheshmogal)> |
Multi Insert Command
In hive we can use multiple insert commands in a single query. This is useful when we want to scan the entire table once and divide it into smaller set of tables in one single query.
Consider the example, you have large tables with employees and you want to create a separate table for each state which has all employees of that state. We can do it with where clause and adding employees to the respective table. However this will cause our large table to be scanned multiple times. Instead of that, we can use the query below to process this requirement efficiently.
1 2 3 4 5 |
hive (maheshmogal)> FROM employee e > INSERT OVERWRIE TABLE emp_ca > SELECT * WHERE e.state = 'CA' > INSERT OVERWRITE TABLE emp_ny > SELECT * WHERE e.state = 'NY'; |
This feature is quite powerful as we can populate multiple tables in a single query. When we will learn about Partitions in the Hive, this feature will be even more useful to dynamically create partitions and fill them with proper data.
Conclusion
In this article, we have learned how to insert data into hive tables using different ways. I hope you have found this useful. See you in the next blog.