Hive - What is the difference between Collect Set and Collect List

Updated On February 12, 2021 | By Mahesh Mogal

Hello all, welcome to another article on Apache Hive. In this article, we will see how can we use COLLECT_SET and COLLECT_LIST to get a list of comma-separated values for a particular column while doing grouping operation. While doing hive queries we have used group by operation very often to perform all kinds of aggregation operations like sum, count, max, etc. Consider there is a table with a schema like the following.

We want to calculate the average salary for each employee and also get a list of departments employee has worked.  If we try to run the query without dept_id in GROUP BY, Hive will complain.

 If we include dept_id in GROUP BY clause we will get a result like this.

Output for average salary query in hive

This is because each employee may have worked in more than one department. that is why we cannot include dept_id in GROUP BY.

Use case I

This is one of a use case where we can use COLLECT_SET and COLLECT_LIST. If we want to list all the departments for an employee we can just use COLLECT_SET which will return an array of DISTINCT dept_id for that employee.

collect_set query out put in hive

We can also use COLLECT_LIST if we do not want to remove duplicate values from a particular column.

COLLECT_SET and COLLECT_LIST return array for column type used. We can perform array operations like the size and contains.

collect_Set output query in hive

Use case II

This is one simple use case. But we can use COLLECT_SET and COLLECT_LIST according to needs to simplify the query. Suppose we want to know an employee has been promoted or not while getting his average salary. We can know about his promotion status from the title column. If an employee has more than two titles we can say that he has been promoted. We can write that query in the following way.

collect_Set output with case when clause in hive

In this way, we can use COLLECT_SET and COLLECT_LIST to simplify queries in certain conditions.

Mahesh Mogal

I am passionate about Cloud, Data Analytics, Machine Learning, and Artificial Intelligence. I like to learn and try out new things. I have started blogging about my experience while learning these exciting technologies.

Stay Updated with Latest Blogs

Get latest blogs delivered to your mail directly.

Recent Posts

Spark Join Types With Examples

In this blog, we are going to learn different spark join types. We will also write code and validate data output for each join type to better understand them.

Read More
Integrate Spark with Jupyter Notebook and Visual Studio Code

In this blog, we are going to integrate spark with jupyter notebook and visual studio code to create easy-to-use development environment.

Read More
Reading Data From SQL Tables in Spark

In this blog, we are going to learn about reading data from SQL tables in Spark. We will create Spark data frames from tables and query results as well.

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *

linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram