Hive Advanced Aggregations with Grouping Sets, Rollup & Cube

Hello everyone. We have used GROUP BY operation to perform aggregations in our queries. Consider the case where we have data with retail store inventory. Every month, we have shipped products to different stores with different product types like clothing, home appliances, etc. Now we want to calculate how many products have we shipped to each store according to product types as well as total products shipped irrespective of its type. This will not be possible in a single group by statement. This is where GROUPING SETS, ROLLUP, and CUBE come in to picture.

GROUPING SETS

Consider we have data in the following format.

Now we need to calculate total products shipped to store in each category as well as total products across all product types. We can use GROUPING SETS for this.

SELECT store_id,product_type,SUM(shipped_units) as total_count
FROM shipped_products
GROUP BY store_id,product_type
GROUPING SETS ((store_id,product_type), store_id);

SELECT store_id,product_type,SUM(shipped_units) as total_count

FROM shipped_products

GROUP BY store_id,product_type

GROUPING SETS ((store_id,product_type), store_id);

This query will give us output in the following format.

Query output after running Groping set query in hive

We have calculated the number of products shipped to each store for each product type as well as the total number of products shipped to that store. Where the product_type column is null we have the total sum of products sent to that store across all product types. For achieving the same result with GROUP BY, we would need to write the following query. We can see that this query is clearly inefficient as it reads an entire table twice to get us the result which GROUPING SET can do it one scan.

SELECT store_id,product_type,SUM(shipped_units) as total_count
FROM shipped_products
GROUP BY store_id,product_type
UNION ALL
SELECT store_id,null as product_type,SUM(shipped_units) as total_count
FROM shipped_products
GROUP BY store_id;

SELECT store_id,product_type,SUM(shipped_units) as total_count

FROM shipped_products

GROUP BY store_id,product_type

UNION ALL

SELECT store_id,null as product_type,SUM(shipped_units) as total_count

FROM shipped_products

GROUP BY store_id;

Consider another example,

SELECT a,b,c SUM(d) from table
GROUP BY a,b,c
GROUPING SETS ((a,b),(a,c),(a,b,c));

SELECT a,b,c SUM(d) from table

GROUP BY a,b,c

GROUPING SETS ((a,b),(a,c),(a,b,c));

To get the same result with GROUP BY we will need the following query.

SELECT a,b,c SUM(d) FROM table
GROUP BY a,b,c
UNION ALL
SELECT a, null as b,c FROM table
GROUP BY a,c
UNIon ALL
SELECT a, b, null as c FROM table
GROUP BY a,b;

SELECT a,b,c SUM(d) FROM table

GROUP BY a,b,c

UNION ALL

SELECT a, null as b,c FROM table

GROUP BY a,c

UNIon ALL

SELECT a, b, null as c FROM table

GROUP BY a,b;

We can use GROUPING_SET to get all products shipped from inventory to all stores and of all product types. If we provide blank set in GROUPING SET clause we will get total products shipped from this inventory.

SELECT store_id,product_type,SUM(shipped_units) as total_count
FROM shipped_products
GROUP BY store_id,product_type
GROUPING SETS ((store_id,product_type), store_id,());

SELECT store_id,product_type,SUM(shipped_units) as total_count

FROM shipped_products

GROUP BY store_id,product_type

GROUPING SETS ((store_id,product_type), store_id,());

Here, the first row shows us the total number of products shipped to all stores and of all product types. We can pivot required values to another column instead of a row in the output. To know how we can pivot rows to columns you can check Pivot rows to columns in Hive

GROUPING__ID

As we can see, while using GROUPING SETS we have columns with null values. It may be confusing sometimes to know which group is used to perform aggregation. We can use GROUPING__ID function to know which group is used for that aggregation results.

SELECT store_id,product_type,SUM(shipped_units) as total_count,
GROUPING__ID
FROM shipped_products
GROUP BY store_id,product_type
GROUPING SETS ((store_id,product_type), store_id,());

SELECT store_id,product_type,SUM(shipped_units) as total_count,

GROUPING__ID

FROM shipped_products

GROUP BY store_id,product_type

GROUPING SETS ((store_id,product_type), store_id,());

Output for query in hive with grouping set and grouping ID

In this case,

0 = no group

1 = aggregation performed over one column i.e stored_id

3= aggregation performed over both column i.e store_id, product_type

ROLLUP and CUBE

ROLLUP AND CUBE are functions built on GROUPING SETS.

In the case of ROLLUP,

SELECT a,b,c,sum(d) FROM table
GROUP BY a,b,c ROLLUP;

1 2	SELECT a,b,c,sum(d) FROM table GROUP BY a,b,c ROLLUP;

This is equivalent to the following GROUPING SET query.

SELECT a,b,c SUM(d) from table
GROUP BY a,b,c
GROUPING SETS ((a,b,c),(a,b),(a),())

SELECT a,b,c SUM(d) from table

GROUP BY a,b,c

GROUPING SETS ((a,b,c),(a,b),(a),())

CUBE considers all possible combinations of columns present in GROUP BY clause.

SELECT a,b,c,sum(d) FROM table
GROUP BY a,b,c CUBE;

1 2	SELECT a,b,c,sum(d) FROM table GROUP BY a,b,c CUBE;

This is equivalent to the following GROUPING SET query.

SELECT a,b,c SUM(d) from table
GROUP BY a,b,c
GROUPING SETS ((a,b,c),(a,b),(b,c),(a,c),(a),(b),(c),());

SELECT a,b,c SUM(d) from table

GROUP BY a,b,c

GROUPING SETS ((a,b,c),(a,b),(b,c),(a,c),(a),(b),(c),());

This is how we can perform enhanced aggregations in Hive which are optimized than normal GROUP BY clause.

Hive – Advanced Aggregations with Grouping sets, Rollup and cube

GROUPING SETS

GROUPING__ID

ROLLUP and CUBE

Hive Data Manipulation – Loading Data to Hive Tables

Creating Database in Hive

External Vs Internal(Managed) Tables in Hive

Alter Table Partitions in Hive

Apache Sqoop – Import data to HDFS

MSCK Repair – Fixing Partitions in Hive Table

2 Comments

GROUPING SETS

GROUPING__ID

ROLLUP and CUBE

Similar Posts

2 Comments