In this blog, we will learn how to filter rows from spark dataframe using Where and Filter functions.
In this blog, we will learn how to get distinct values from columns or rows in the Spark dataframe. We will also learn how we can count distinct values. We will be using our same flight data for example.
Consider that we want to get all combinations of source and destination countries from our data. We can easily do this using the following code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
df_csv = spark.read.format("csv") \ .option("inferSchema", "true") \ .option("header","true") \ .load("data/flights.csv") # getting distinct rows from a data frame df_csv.select("DEST_COUNTRY_NAME", "ORIGIN_COUNTRY_NAME") \ .distinct() \ .show(5) +-----------------+-------------------+ |DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| +-----------------+-------------------+ | Croatia| United States| | Kosovo| United States| | Romania| United States| | Ireland| United States| | United States| Egypt| +-----------------+-------------------+ only showing top 5 rows |
In spark, we can chain multiple operations one after another. Here we are using where clause with distinct values.
1 2 3 4 |
df_csv.select("DEST_COUNTRY_NAME", "ORIGIN_COUNTRY_NAME") \ .where("DEST_COUNTRY_NAME != ORIGIN_COUNTRY_NAME") \ .distinct() \ .show(5) |
We can also easily count distinct values by chaining count function after we distinct function.
1 2 3 4 |
df_csv.select("DEST_COUNTRY_NAME", "ORIGIN_COUNTRY_NAME") \ .distinct() \ .count() 256 |
1 2 3 4 5 |
df_csv.select("DEST_COUNTRY_NAME", "ORIGIN_COUNTRY_NAME") \ .where("DEST_COUNTRY_NAME != ORIGIN_COUNTRY_NAME") \ .distinct() \ .count() 255 |
If we want to drop all duplicate rows from the dataframe we can also use "dropDuplicates" function.
1 |
df_csv.dropDuplicates().show(2) |
I hope this helps. See you soon 🙂
In this blog, we will learn how to filter rows from spark dataframe using Where and Filter functions.
Getting distinct values from columns or rows is one of most used operations. We will learn how to get distinct values as well as count of distinct values.
In this blog, we will learn how to sort rows in spark dataframe based on some column values.