Sorting in Spark Dataframe

In this blog, we are going to write code to sort the spark dataframe. We can sort our data based on one or more columns just like we do it in SQL. Spark provides two function to sort data, “sort” & “orderBy”.

Both of these functions work in the same way. We will mostly be using “orderBy” as it is more close to SQL like syntax.

Sorting Dataframe based on Column Value

Consider our flight data, we want to sort our dataframe using number of flights.

df_csv = spark.read.format("csv") \
        .option("inferSchema", "true") \
        .option("header","true") \
        .load("data/flights.csv")
#sorting data on count
df_csv.sort("count").show(2)
+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|          Moldova|      United States|    1|
|    United States|            Croatia|    1|
+-----------------+-------------------+-----+

df_csv = spark.read.format("csv") \

.option("inferSchema", "true") \

.option("header","true") \

.load("data/flights.csv")

#sorting data on count

df_csv.sort("count").show(2)

+-----------------+-------------------+-----+

|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|

+-----------------+-------------------+-----+

| Moldova| United States| 1|

| United States| Croatia| 1|

+-----------------+-------------------+-----+

We can similarly output using “orderBy”. As you can see, data is sorted in ascending order by default.

df_csv.orderBy("count", "DEST_COUNTRY_NAME").show(2)

1	df_csv.orderBy("count", "DEST_COUNTRY_NAME").show(2)

We can also use column expression or column functions with our sorting functions.

from pyspark.sql.functions import col
df_csv.orderBy(col("count"), col("DEST_COUNTRY_NAME")).show(2)
+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|     Burkina Faso|      United States|    1|
|    Cote d'Ivoire|      United States|    1|
+-----------------+-------------------+-----+

from pyspark.sql.functions import col

df_csv.orderBy(col("count"), col("DEST_COUNTRY_NAME")).show(2)

+-----------------+-------------------+-----+

|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|

+-----------------+-------------------+-----+

| Burkina Faso| United States| 1|

| Cote d'Ivoire| United States| 1|

+-----------------+-------------------+-----+

Sorting data in Descending order

If we want to change default sorting order for Spark dataframe, we have to use desc function.

from pyspark.sql.functions import desc
df_csv.sort(col("count").desc()).show(2)

1 2	from pyspark.sql.functions import desc df_csv.sort(col("count").desc()).show(2)

As seen in output, we can sort data in desending order using sparks inbult desc function.

I hope you found this useful. See you in next blog.

Sorting in Spark Dataframe

Sorting Dataframe based on Column Value

Sorting data in Descending order

Renaming DataFrame Columns in Spark

Reading data from a file in Spark

Integrate Spark with Jupyter Notebook and Visual Studio Code

Adding Custom Schema to Spark Dataframe

Spark Join Types With Examples

Running SQL queries on Spark DataFrames

Sorting Dataframe based on Column Value

Sorting data in Descending order

Similar Posts