Partitioning in Hive

IN Big Data Systems, we deal with GBs, TBs, or even Petabytes of data. When querying such huge datasets, we need to organize data in such ways that we can query and analyze data efficiently. This is where Data Partitions come into the picture. The concept of partition is not new but Hive uses them to keep data in different folders and use them intelligently while querying data. Let us understand more about Partitioning.

What is Partitioning and why to use it?

Before we learn how to create partitioned tables in Hive, let us see why should we use partitioning in Hive and what benefits it brings us.

Consider that we have data for the E-Commerce website with a huge number of orders (Amazon, eBay, etc). Suppose, we have all orders placed without partitioning in this table. But when we query such a table, say for getting all orders for month 11 and year 2019, Hive has to go and read all of the data, then filter data for required condition and then form aggregation.

You can see how it go out of hand when we have huge dateset and all we want is data for few months from one of year.

In comes partitioning. Let us keep data in orders tables such that we have following folder strecture.

  • orders/year=2019/month=01
  • orders/year=2019/month=02
  • orders/year=2019/month=11
  • orders/year=2019/month=12
  • ….

In this case, when Hive has to query for data in the year 2019, and month 11, it can go to directory orders/year=2019/month=11 and read that data. No matter how much other data or partitions we have, Hive has to read-only that amount of data to get our results back.

This improves performance drastically. We can run complex analytics queries with good performance as filter operation(Hive only has to read data from a few partitions specified in where clause) is performed even before query execution is started.

Now, that we have seen benefits of partitions, let us see how we can implement them in Hive.

Partition Managed Tables In Hive

We will discuss managed partition tables first. Consider our table orders as above.

While creating a table we can easily specify partitioning columns. Note that we do not have mention partition columns in the column list of tables.

So when we insert data into this table, each partition will have its separate folder. and when we run a query like “SELECT COUNT(1) FROM order_partition WHERE year=2019 and month=11”, Hive directly goes to that directory in HDFS and read all data instated of scanning whole table and then filtering data for given condition.

I have inserted data into this table. (We will learn how to write data to a partitioned table in next paragraphs). When we check the HDFS directory we can see the number of partitions created for our orders.

Let us select few rows from this table and see what kind of output we see.

Hive Partition Table Output
Hive Partition Table Output

Our table result shows our partitioned columns year and month. If you remember we did not include them in our table definition. Then how does Hive knows what value to pick for them and why did we not keep them in the table?

Even if we did not store values for partition columns in the table, Hive picks these values from directory names. That is why a query result can show us these values. We do not need to store these values in the table as a Hive can figure out those values and storing them again in the table will only introduce redundant storage.

We do not need to include partition columns in hive table definition.

Hive Facts

Partitioned External Tables in Hive

You can also use partitioning with external tables (You can read more about external vs managed tables in hive here). Rather you will find using partitioning more with external tables. External tables provide us with flexibility in selecting the HDFS path for our table and this is quite useful along with partitions. Let us see this in action

We are pointing this external table to the location of our previously created managed table. This will not add partitions to our table though. Hive does not know anything about what data or folders we have at that location ( more on this in Inserting data into partition tables).

To load data as partitions in this table we can run following command.

We will revisit the inserting data in partitioned tables in the next tutorial. For now, you can check that this query will add a partition to our table and also let Hive know where that partition data is located in HDFS.

Listing Partitions in Table

Now that we have created partition tables, let us learn how we can check table is partitioned or not and if it is partitioned, list down all partitions of table.

We can use either show create a table or describe formatted table command to get metadata for that table. If the table is partitioned it will show that in detail.

Partition Table Check

For listing partitions of table we can use below query.

Listing Down Subset of Partitions

You can have 100s of partitions in table. There is way you can only list partitions from one parent directory.

Conclusion

We have learned what is partitioning and why we use it in Hive. We have also created partitioned tables and learned about its basics.

In the next two blogs, we will learn more about partitioning in Hive. First, we will see what is static and dynamic partitioning, and finally, we will go through how we can load data to partitioned tables. See you there :).

Similar Posts

Leave a Reply

Your email address will not be published.