Hive – What is the difference between Collect Set and Collect List
Hello all, welcome to another article on Apache Hive. In this article, we will see how can we use COLLECT_SET and COLLECT_LIST to get a list of comma-separated values for a particular column while doing grouping operation. While doing hive queries we have used group by operation very often to perform all kinds of aggregation operations like sum, count, max, etc. Consider there is a table with a schema like the following.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
hive- show create table employee; OK CREATE TABLE employee( emp_no int, birth_date bigint, first_name string, last_name string, gender string, hire_date bigint, salary int, salary_from_date bigint, salary_to_date bigint, dept_no string, dept_from_date bigint, dept_to_date bigint, title string, titile_from_date bigint, title_to_date bigint) |
We want to calculate the average salary for each employee and also get a list of departments employee has worked. If we try to run the query without dept_id in GROUP BY, Hive will complain.
1 2 3 4 |
select emp_no,dept_no,avg(salary) from employee group by emp_no Error while compiling statement: FAILED: SemanticException [Error 10025]: line 1:14 Expression not in GROUP BY key 'dept_no' |
 If we include dept_id in GROUP BY clause we will get a result like this.
1 2 3 |
select emp_no,dept_no,avg(salary) from employee where emp_no in (14979,51582,10001,10002) group by emp_no,dept_no; |
This is because each employee may have worked in more than one department. that is why we cannot include dept_id in GROUP BY.
Use case I
This is one of a use case where we can use COLLECT_SET and COLLECT_LIST. If we want to list all the departments for an employee we can just use COLLECT_SET which will return an array of DISTINCT dept_id for that employee.
1 2 3 |
select emp_no,COLLECT_SET(dept_no) as dept_no_list,avg(salary) from employee where emp_no in (14979,51582,10001,10002) group by emp_no; |
We can also use COLLECT_LIST if we do not want to remove duplicate values from a particular column.
1 2 3 |
select emp_no,COLLECT_LIST(dept_no) as dept_no_list,avg(salary) from employee where emp_no in (14979,51582,10001,10002) group by emp_no; |
COLLECT_SET and COLLECT_LIST return array for column type used. We can perform array operations like the size and contains.
1 2 3 4 5 6 |
select emp_no,collect_set(dept_no) as dept_no_list ,avg(salary) ,size(collect_set(dept_no)) as total_dept_count ,array_contains(collect_set(dept_no),'d007') as have_worked_in_d007 from employee where emp_no in (14979,51582,10001,10002) group by emp_no; |
Use case II
This is one simple use case. But we can use COLLECT_SET and COLLECT_LIST according to needs to simplify the query. Suppose we want to know an employee has been promoted or not while getting his average salary. We can know about his promotion status from the title column. If an employee has more than two titles we can say that he has been promoted. We can write that query in the following way.
1 2 3 4 5 6 |
select emp_no,collect_set(dept_no) as dept_no_list ,avg(salary) ,case when size(collect_set(title)) > 1 then 'Promoted' else 'Not Promoted' end as promotion_status from employee where emp_no in (14979,51582,10001,10002) group by emp_no; |
In this way, we can use COLLECT_SET and COLLECT_LIST to simplify queries in certain conditions.
This is clear and nice explanation.
Can you please provide data to insert into table