In this blog, we will learn how to filter rows from spark dataframe using Where and Filter functions.
In this blog, we are going to write code to sort the spark dataframe. We can sort our data based on one or more columns just like we do it in SQL. Spark provides two function to sort data, "sort" & "orderBy".
Both of these functions work in the same way. We will mostly be using "orderBy" as it is more close to SQL like syntax.
Consider our flight data, we want to sort our dataframe using number of flights.
1 2 3 4 5 6 7 8 9 10 11 12 |
df_csv = spark.read.format("csv") \ .option("inferSchema", "true") \ .option("header","true") \ .load("data/flights.csv") #sorting data on count df_csv.sort("count").show(2) +-----------------+-------------------+-----+ |DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count| +-----------------+-------------------+-----+ | Moldova| United States| 1| | United States| Croatia| 1| +-----------------+-------------------+-----+ |
We can similarly output using "orderBy". As you can see, data is sorted in ascending order by default.
1 |
df_csv.orderBy("count", "DEST_COUNTRY_NAME").show(2) |
We can also use column expression or column functions with our sorting functions.
1 2 3 4 5 6 7 8 |
from pyspark.sql.functions import col df_csv.orderBy(col("count"), col("DEST_COUNTRY_NAME")).show(2) +-----------------+-------------------+-----+ |DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count| +-----------------+-------------------+-----+ | Burkina Faso| United States| 1| | Cote d'Ivoire| United States| 1| +-----------------+-------------------+-----+ |
If we want to change default sorting order for Spark dataframe, we have to use desc function.
1 2 |
from pyspark.sql.functions import desc df_csv.sort(col("count").desc()).show(2) |
As seen in output, we can sort data in desending order using sparks inbult desc function.
I hope you found this useful. See you in next blog.
In this blog, we will learn how to filter rows from spark dataframe using Where and Filter functions.
Getting distinct values from columns or rows is one of most used operations. We will learn how to get distinct values as well as count of distinct values.
In this blog, we will learn how to sort rows in spark dataframe based on some column values.