Distinct Rows and Distinct Count from Spark Dataframe

In this blog, we will learn how to get distinct values from columns or rows in the Spark dataframe. We will also learn how we can count distinct values. We will be using our same flight data for example.

Distinct Values from Dataframe

Consider that we want to get all combinations of source and destination countries from our data. We can easily do this using the following code.

df_csv = spark.read.format("csv") \
        .option("inferSchema", "true") \
        .option("header","true") \
        .load("data/flights.csv")
# getting distinct rows from a data frame
df_csv.select("DEST_COUNTRY_NAME", "ORIGIN_COUNTRY_NAME") \
    .distinct() \
    .show(5)
+-----------------+-------------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|
+-----------------+-------------------+
|          Croatia|      United States|
|           Kosovo|      United States|
|          Romania|      United States|
|          Ireland|      United States|
|    United States|              Egypt|
+-----------------+-------------------+
only showing top 5 rows

df_csv = spark.read.format("csv") \

.option("inferSchema", "true") \

.option("header","true") \

.load("data/flights.csv")

# getting distinct rows from a data frame

df_csv.select("DEST_COUNTRY_NAME", "ORIGIN_COUNTRY_NAME") \

.distinct() \

.show(5)

+-----------------+-------------------+

|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|

+-----------------+-------------------+

| Croatia| United States|

| Kosovo| United States|

| Romania| United States|

| Ireland| United States|

| United States| Egypt|

+-----------------+-------------------+

only showing top 5 rows

In spark, we can chain multiple operations one after another. Here we are using where clause with distinct values.

df_csv.select("DEST_COUNTRY_NAME", "ORIGIN_COUNTRY_NAME") \
    .where("DEST_COUNTRY_NAME != ORIGIN_COUNTRY_NAME") \
    .distinct() \
    .show(5)

df_csv.select("DEST_COUNTRY_NAME", "ORIGIN_COUNTRY_NAME") \

.where("DEST_COUNTRY_NAME != ORIGIN_COUNTRY_NAME") \

.distinct() \

.show(5)

Counting Distinct Values

We can also easily count distinct values by chaining count function after we distinct function.

df_csv.select("DEST_COUNTRY_NAME", "ORIGIN_COUNTRY_NAME") \
    .distinct() \
    .count()
256

df_csv.select("DEST_COUNTRY_NAME", "ORIGIN_COUNTRY_NAME") \

.distinct() \

.count()

256

df_csv.select("DEST_COUNTRY_NAME", "ORIGIN_COUNTRY_NAME") \
    .where("DEST_COUNTRY_NAME != ORIGIN_COUNTRY_NAME") \
    .distinct() \
    .count()
255

df_csv.select("DEST_COUNTRY_NAME", "ORIGIN_COUNTRY_NAME") \

.where("DEST_COUNTRY_NAME != ORIGIN_COUNTRY_NAME") \

.distinct() \

.count()

255

Using dropDuplicates function

If we want to drop all duplicate rows from the dataframe we can also use “dropDuplicates” function.

df_csv.dropDuplicates().show(2)

1	df_csv.dropDuplicates().show(2)

I hope this helps. See you soon 🙂

Distinct Rows and Distinct Count from Spark Dataframe

Distinct Values from Dataframe

Counting Distinct Values

Using dropDuplicates function

Reading Data From SQL Tables in Spark

Removing White Spaces From Data in Spark

Date & Timestamp Functions in Spark

Adding White Spaces to Data in Spark Dataframe

Date Difference functions in Spark

Reading JSON data in Spark

Distinct Values from Dataframe

Counting Distinct Values

Using dropDuplicates function

Similar Posts