Sorting in Spark Dataframe
In this blog, we are going to write code to sort the spark dataframe. We can sort our data based on one or more columns just like we do it in SQL. Spark provides two function to sort data, “sort” & “orderBy”.
Both of these functions work in the same way. We will mostly be using “orderBy” as it is more close to SQL like syntax.
Sorting Dataframe based on Column Value
Consider our flight data, we want to sort our dataframe using number of flights.
1 2 3 4 5 6 7 8 9 10 11 12 |
df_csv = spark.read.format("csv") \ .option("inferSchema", "true") \ .option("header","true") \ .load("data/flights.csv") #sorting data on count df_csv.sort("count").show(2) +-----------------+-------------------+-----+ |DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count| +-----------------+-------------------+-----+ | Moldova| United States| 1| | United States| Croatia| 1| +-----------------+-------------------+-----+ |
We can similarly output using “orderBy”. As you can see, data is sorted in ascending order by default.
1 |
df_csv.orderBy("count", "DEST_COUNTRY_NAME").show(2) |
We can also use column expression or column functions with our sorting functions.
1 2 3 4 5 6 7 8 |
from pyspark.sql.functions import col df_csv.orderBy(col("count"), col("DEST_COUNTRY_NAME")).show(2) +-----------------+-------------------+-----+ |DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count| +-----------------+-------------------+-----+ | Burkina Faso| United States| 1| | Cote d'Ivoire| United States| 1| +-----------------+-------------------+-----+ |
Sorting data in Descending order
If we want to change default sorting order for Spark dataframe, we have to use desc function.
1 2 |
from pyspark.sql.functions import desc df_csv.sort(col("count").desc()).show(2) |
As seen in output, we can sort data in desending order using sparks inbult desc function.
I hope you found this useful. See you in next blog.