Select Expr in Spark Dataframe

Select and Expr are one of the most used functions in the Spark dataframe. In this blog, we will learn different things that we can do with select and expr functions.

Selecting Columns from Dataframe

This is one of the most used functions for the data frame and we can use Select with “expr” to do this. Before using “expr” function we need to import it.

df_csv = spark.read.format("csv") \
        .option("inferSchema", "true") \
        .option("header","true") \
        .load("data/flights.csv")
# selecting columns
from pyspark.sql.functions import expr
df_csv.select(expr("count")).show(2)

df_csv = spark.read.format("csv") \

.option("inferSchema", "true") \

.option("header","true") \

.load("data/flights.csv")

# selecting columns

from pyspark.sql.functions import expr

df_csv.select(expr("count")).show(2)

Operations on Column Data

A more interesting use case for “expr” is to perform different operations on column data. We can use it to get length or column, extract data, or anything which we can do it SQL.

df_csv.select(expr("count"), expr("count > 10")).show(2)
+-----+------------+
|count|(count > 10)|
+-----+------------+
|   15|        true|
|    1|       false|
+-----+------------+

df_csv.select(expr("count"), expr("count > 10")).show(2)

+-----+------------+

|count|(count > 10)|

+-----+------------+

| 15| true|

| 1| false|

+-----+------------+

In the above code, we are printing value in the column filed is greater than 10 or not. You can see that our column name is not very user friendly. Just like in SQL, we can give usable column names.

df_csv.select(expr("count"), expr("count > 10 as if_greater_than_10")).show(2)

1	df_csv.select(expr("count"), expr("count > 10 as if_greater_than_10")).show(2)

Using Alias with Expr

We can use Alias function to give user friendly names to columns.

df_csv.select(expr("count"), expr("count > 10").alias("if_greater_than_10")).show(2)

1	df_csv.select(expr("count"), expr("count > 10").alias("if_greater_than_10")).show(2)

Selecting All columns with expr

We might need to select other columns from the dataframe along with the newly created expression column. This is very easy. We can give a comma-separated list of columns or use “*” to list all columns from the data frame.

df_csv.select("*", expr("count > 10").alias("if_greater_than_10")).show(2)

1	df_csv.select("*", expr("count > 10").alias("if_greater_than_10")).show(2)

Selecting All columns with Expr — Selecting All Columns from Dataframe with Expr

Renaming Columns with Expr and Alias

We can use Expr to rename columns in data frame. This will actually create new columns instead of renaming old one.

df_csv.select("*", expr("DEST_COUNTRY_NAME").alias("dest")).show(2)
+-----------------+-------------------+-----+-------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|         dest|
+-----------------+-------------------+-----+-------------+
|    United States|            Romania|   15|United States|
|    United States|            Croatia|    1|United States|
+-----------------+-------------------+-----+-------------+

df_csv.select("*", expr("DEST_COUNTRY_NAME").alias("dest")).show(2)

+-----------------+-------------------+-----+-------------+

+-----------------+-------------------+-----+-------------+

+-----------------+-------------------+-----+-------------+

SelectExpr – shorthand for select and expr

Select and Expr is so much widely used while working with Spark dataframe, that the Spark team has given shorthand to use it. We can use selectExpr function.

df_csv.selectExpr("count", "count > 10 as if_greater_than_10").show(2)
df_csv.selectExpr("*", "DEST_COUNTRY_NAME as dest").show(2)

1 2	df_csv.selectExpr("count", "count > 10 as if_greater_than_10").show(2) df_csv.selectExpr("*", "DEST_COUNTRY_NAME as dest").show(2)

I hope you found this useful. See you in next blog.

Select Expr in Spark Dataframe

Selecting Columns from Dataframe

Operations on Column Data

Using Alias with Expr

Selecting All columns with expr

Renaming Columns with Expr and Alias

SelectExpr – shorthand for select and expr

Reading data from a file in Spark

Read CSV Data in Spark

Date & Timestamp Functions in Spark

Converting Strings to Dates in Spark

Sorting in Spark Dataframe

Aggregation Functions in Spark

Selecting Columns from Dataframe

Operations on Column Data

Using Alias with Expr

Selecting All columns with expr

Renaming Columns with Expr and Alias

SelectExpr – shorthand for select and expr

Similar Posts