Filtering Rows in Spark Using Where and Filter

Filtering rows from dataframe is one of the basic tasks performed when analyzing data with Spark. Spark provides two ways to filter data. Where and Filter function. Both of these functions work in the same way, but mostly we will be using “where” due to its familiarity with SQL.

Using Where / Filter in Spark Dataframe

We can easily filter rows with some conditions as we do in SQL using “Where” function. Say we need to find all rows where the number of flights is more than 50 between the two countries.

df_csv = spark.read.format("csv") \
        .option("inferSchema", "true") \
        .option("header","true") \
        .load("data/flights.csv")
df_csv.where("count > 50").show(5)
+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Ireland|  344|
|    United States|              India|   62|
|    United States|            Grenada|   62|
|       Costa Rica|      United States|  588|
|    United States|       Sint Maarten|  325|
+-----------------+-------------------+-----+
only showing top 5 rows

df_csv = spark.read.format("csv") \

.option("inferSchema", "true") \

.option("header","true") \

.load("data/flights.csv")

df_csv.where("count > 50").show(5)

+-----------------+-------------------+-----+

|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|

+-----------------+-------------------+-----+

| United States| Ireland| 344|

| United States| India| 62|

| United States| Grenada| 62|

| Costa Rica| United States| 588|

| United States| Sint Maarten| 325|

+-----------------+-------------------+-----+

only showing top 5 rows

We can also use column expressions. This time we will use “Filter” function to get desired rows from dataframe.

from pyspark.sql.functions import col
df_csv.filter(col("DEST_COUNTRY_NAME") == "United States").show(5)

1 2	from pyspark.sql.functions import col df_csv.filter(col("DEST_COUNTRY_NAME") == "United States").show(5)

Chaining Multiple Conditions

Though it is possible to write multiple where conditions in one statement, it is not necessary. Even when we chain multiple conditions one after another while creating a physical plan for execution spark will optimize these operations in one single step.

That is why it is always a better idea to write multiple where conditions separately which will be easier to understand while reading code.

df_csv.where("DEST_COUNTRY_NAME == 'United States'") \
    .where("count>50") \
    .show(5)

df_csv.where("DEST_COUNTRY_NAME == 'United States'") \

.where("count>50") \

.show(5)

I hope you found this useful :). See you in next blog.

2 Comments

vivek says:

March 12, 2021 at 10:50 PM

Hi Mahesh ,

Could you pls share the Data set which you used in your example , it will be great help in practice
1. Mahesh Mogal says:
  
  March 20, 2021 at 10:51 AM
  
  Hello Vivek,
  I have committed this data at below git repo
  https://github.com/mogalmahesh/spark-basics
  
  I hope this helps.
  Thanks

Comments are closed.

Where and Filter in Spark Dataframes

Using Where / Filter in Spark Dataframe

Chaining Multiple Conditions

Aggregation Functions in Spark

Adding Custom Schema to Spark Dataframe

Spark Join Types With Examples

Add, Rename, Drop Columns in Spark Dataframe

Date Difference functions in Spark

Reading Parquet and ORC data in Spark

2 Comments

Using Where / Filter in Spark Dataframe

Chaining Multiple Conditions

Similar Posts

2 Comments