Hive – Advanced Aggregations with Grouping sets, Rollup and cube
Hello everyone. We have used GROUP BY operation to perform aggregations in our queries. Consider the case where we have data with retail store inventory. Every month, we have shipped products to different stores with different product types like clothing, home appliances, etc. Now we want to calculate how many products have we shipped to each store according to product types as well as total products shipped irrespective of its type. This will not be possible in a single group by statement. This is where GROUPING SETS, ROLLUP, and CUBE come in to picture.
GROUPING SETS
Consider we have data in the following format.
Now we need to calculate total products shipped to store in each category as well as total products across all product types. We can use GROUPING SETS for this.
1 2 3 4 |
SELECT store_id,product_type,SUM(shipped_units) as total_count FROM shipped_products GROUP BY store_id,product_type GROUPING SETS ((store_id,product_type), store_id); |
This query will give us output in the following format.
We have calculated the number of products shipped to each store for each product type as well as the total number of products shipped to that store. Where the product_type column is null we have the total sum of products sent to that store across all product types. For achieving the same result with GROUP BY, we would need to write the following query. We can see that this query is clearly inefficient as it reads an entire table twice to get us the result which GROUPING SET can do it one scan.
1 2 3 4 5 6 7 |
SELECT store_id,product_type,SUM(shipped_units) as total_count FROM shipped_products GROUP BY store_id,product_type UNION ALL SELECT store_id,null as product_type,SUM(shipped_units) as total_count FROM shipped_products GROUP BY store_id; |
Consider another example,
1 2 3 |
SELECT a,b,c SUM(d) from table GROUP BY a,b,c GROUPING SETS ((a,b),(a,c),(a,b,c)); |
To get the same result with GROUP BY we will need the following query.
1 2 3 4 5 6 7 8 |
SELECT a,b,c SUM(d) FROM table GROUP BY a,b,c UNION ALL SELECT a, null as b,c FROM table GROUP BY a,c UNIon ALL SELECT a, b, null as c FROM table GROUP BY a,b; |
We can use GROUPING_SET to get all products shipped from inventory to all stores and of all product types. If we provide blank set in GROUPING SET clause we will get total products shipped from this inventory.
1 2 3 4 |
SELECT store_id,product_type,SUM(shipped_units) as total_count FROM shipped_products GROUP BY store_id,product_type GROUPING SETS ((store_id,product_type), store_id,()); |
Here, the first row shows us the total number of products shipped to all stores and of all product types. We can pivot required values to another column instead of a row in the output. To know how we can pivot rows to columns you can check Pivot rows to columns in Hive
GROUPING__ID
As we can see, while using GROUPING SETS we have columns with null values. It may be confusing sometimes to know which group is used to perform aggregation. We can use GROUPING__ID function to know which group is used for that aggregation results.
1 2 3 4 5 |
SELECT store_id,product_type,SUM(shipped_units) as total_count, GROUPING__ID FROM shipped_products GROUP BY store_id,product_type GROUPING SETS ((store_id,product_type), store_id,()); |
In this case,
0 = no group
1 = aggregation performed over one column i.e stored_id
3= aggregation performed over both column i.e store_id, product_type
ROLLUP and CUBE
ROLLUP AND CUBE are functions built on GROUPING SETS.
In the case of ROLLUP,
1 2 |
SELECT a,b,c,sum(d) FROM table GROUP BY a,b,c ROLLUP; |
This is equivalent to the following GROUPING SET query.
1 2 3 |
SELECT a,b,c SUM(d) from table GROUP BY a,b,c GROUPING SETS ((a,b,c),(a,b),(a),()) |
CUBE considers all possible combinations of columns present in GROUP BY clause.
1 2 |
SELECT a,b,c,sum(d) FROM table GROUP BY a,b,c CUBE; |
This is equivalent to the following GROUPING SET query.
1 2 3 |
SELECT a,b,c SUM(d) from table GROUP BY a,b,c GROUPING SETS ((a,b,c),(a,b),(b,c),(a,c),(a),(b),(c),()); |
This is how we can perform enhanced aggregations in Hive which are optimized than normal GROUP BY clause.
That’s an amazingly simplified explanation if the advanced aggregate functions.
Thanks a lot.
Thanks Amijeet