Aggregation Functions in Spark

Aggregation Functions are important part of big data analytics. When processing data, we need to a lot of different functions so it is a good thing Spark has provided us many in built functions. In this blog, we are going to learn aggregation functions in Spark.

Count

This is one of basic function where we count number of records or specify column to count. Let us see its example.

We can also use count with select expression for Data frame.

If you want to know more about how to run SQL queries on spark data frames, you can read Running SQL queries on Spark DataFrames.

Count Distinct

We can also count distinct number of values from some column. For example, when we count number of countries surely we should not get 1506. Here, we will need to use count distinct function.

Approximate Count Distinct

When we are dealing with huge data sets, many times we do not need an exact value for distinct count. We can work with approximate value only. This will run much faster compared to count distinct function.

We can see that output with count distinct 164 is approximately near to the actual value of 170. With approx count distinct function we can also pass second parameter which decides maximum acceptable error while calculating distinct count.

We can also run SQL query to get approximate count.

First and Last

With First and Last function we can get first and last value of some column from data frame.

Min and Max

With these aptly named functions, we can find minimum and maximum value for some column in the data frame.

Sum

Another function available is SUM, which we can use to sum all values from the column.

Sum Distinct

Like Count Distinct, we can also sum only distinct values from some column. This below example, it will not make a lot of sense but it should give you an idea how to use sum distinct function.

Average

Though we can calculate average by sum of values divided by count for some column, there is in built average function available as well.

Collect Set and Collect List

We can also aggregate values from some column using collect set and collect list functions. Both functions create an array from all values of that column. Only difference is collect set does not have any duplicates whereas collect list will have duplicate values as well.

Grouping Data

Till now we have done aggregations on the data frame level. We can also split data in groups depending on some value and get aggregate values. Below are some examples for this.

Conclusion

In this blog, we have gone through basic aggregation functions in Spark. There are many more functions available and we will go over them in the next few blogs. You can find code written in this blog at GitHub. See you in the next article.

Similar Posts

Leave a Reply

Your email address will not be published.