Sorting in Spark Dataframe

Updated On September 13, 2020 | By Mahesh Mogal

In this blog, we are going to write code to sort the spark dataframe. We can sort our data based on one or more columns just like we do it in SQL. Spark provides two function to sort data, "sort" & "orderBy".

Both of these functions work in the same way. We will mostly be using "orderBy" as it is more close to SQL like syntax.

Sorting Dataframe based on Column Value

Consider our flight data, we want to sort our dataframe using number of flights.

df_csv = spark.read.format("csv") \
        .option("inferSchema", "true") \
        .option("header","true") \
        .load("data/flights.csv")

#sorting data on count
df_csv.sort("count").show(2)
+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|          Moldova|      United States|    1|
|    United States|            Croatia|    1|
+-----------------+-------------------+-----+

We can similarly output using "orderBy". As you can see, data is sorted in ascending order by default.

df_csv.orderBy("count", "DEST_COUNTRY_NAME").show(2)
Sorting Rows Using Orderby
Sorting Rows Using Orderby

We can also use column expression or column functions with our sorting functions.

from pyspark.sql.functions import col
df_csv.orderBy(col("count"), col("DEST_COUNTRY_NAME")).show(2)
+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|     Burkina Faso|      United States|    1|
|    Cote d'Ivoire|      United States|    1|
+-----------------+-------------------+-----+

Sorting data in Descending order

If we want to change default sorting order for Spark dataframe, we have to use desc function.

from pyspark.sql.functions import desc
df_csv.sort(col("count").desc()).show(2)
Sorting Data in Descending Order
Sorting Data in Descending Order

As seen in output, we can sort data in desending order using sparks inbult desc function.

I hope you found this useful. See you in next blog.

.

Mahesh Mogal

I am passionate about Cloud, Data Analytics, Machine Learning, and Artificial Intelligence. I like to learn and try out new things. I have started blogging about my experience while learning these exciting technologies.

Stay Updated with Latest Blogs

Get latest blogs delivered to your mail directly.

Recent Posts

Sorting in Spark Dataframe

In this blog, we will learn how to sort rows in spark dataframe based on some column values.

Read More
Removing White Spaces From Data in Spark

White spaces can be a headache if not removed before processing data. We will learn how to remove spaces from data in spark using inbuilt functions.

Read More
Padding Data in Spark Dataframe

In this blog, we will learn how to use rpad and lpad functions to add padding to data in spark dataframe.

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *

linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram
Share via
Copy link
Powered by Social Snap