Aggregation Functions in Spark

Updated On March 28, 2021 | By Mahesh Mogal

Aggregation Functions are important part of big data analytics. When processing data, we need to a lot of different functions so it is a good thing Spark has provided us many in built functions. In this blog, we are going to learn aggregation functions in Spark.


This is one of basic function where we count number of records or specify column to count. Let us see its example.

We can also use count with select expression for Data frame.

If you want to know more about how to run SQL queries on spark data frames, you can read Running SQL queries on Spark DataFrames.

Count Distinct

We can also count distinct number of values from some column. For example, when we count number of countries surely we should not get 1506. Here, we will need to use count distinct function.

Approximate Count Distinct

When we are dealing with huge data sets, many times we do not need an exact value for distinct count. We can work with approximate value only. This will run much faster compared to count distinct function.

We can see that output with count distinct 164 is approximately near to the actual value of 170. With approx count distinct function we can also pass second parameter which decides maximum acceptable error while calculating distinct count.

We can also run SQL query to get approximate count.

First and Last

With First and Last function we can get first and last value of some column from data frame.

Min and Max

With these aptly named functions, we can find minimum and maximum value for some column in the data frame.


Another function available is SUM, which we can use to sum all values from the column.

Sum Distinct

Like Count Distinct, we can also sum only distinct values from some column. This below example, it will not make a lot of sense but it should give you an idea how to use sum distinct function.


Though we can calculate average by sum of values divided by count for some column, there is in built average function available as well.

Collect Set and Collect List

We can also aggregate values from some column using collect set and collect list functions. Both functions create an array from all values of that column. Only difference is collect set does not have any duplicates whereas collect list will have duplicate values as well.

Grouping Data

Till now we have done aggregations on the data frame level. We can also split data in groups depending on some value and get aggregate values. Below are some examples for this.


In this blog, we have gone through basic aggregation functions in Spark. There are many more functions available and we will go over them in the next few blogs. You can find code written in this blog at GitHub. See you in the next article.

Mahesh Mogal

I am passionate about Cloud, Data Analytics, Machine Learning, and Artificial Intelligence. I like to learn and try out new things. I have started blogging about my experience while learning these exciting technologies.

Stay Updated with Latest Blogs

Get latest blogs delivered to your mail directly.

Recent Posts

Spark Join Types With Examples

In this blog, we are going to learn different spark join types. We will also write code and validate data output for each join type to better understand them.

Read More
Integrate Spark with Jupyter Notebook and Visual Studio Code

In this blog, we are going to integrate spark with jupyter notebook and visual studio code to create easy-to-use development environment.

Read More
Reading Data From SQL Tables in Spark

In this blog, we are going to learn about reading data from SQL tables in Spark. We will create Spark data frames from tables and query results as well.

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *

linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram