Add Rename Drop Columns in Spark Dataframe

In this blog, we will go through some of the most used column operations performed on columns of a data frame in Spark. We will start with how to select columns from dataframe. After that, we will go through how to add, rename, and drop columns from spark dataframe. Let us get started.

Selecting Columns from Spark Dataframe

There are multiple ways we can select columns from dataframe. one of Easiest way is to use column names as string in select function of dataframe.

df_csv = spark.read.format("csv") \
        .option("inferSchema", "true") \
        .option("header","true") \
        .load("data/flights.csv")
df_csv.select("DEST_COUNTRY_NAME").show(2)

df_csv = spark.read.format("csv") \

.option("inferSchema", "true") \

.option("header","true") \

.load("data/flights.csv")

df_csv.select("DEST_COUNTRY_NAME").show(2)

Spark has also provided few inbuilt function to work with columns. Before we can use them we need to import those functions.

from pyspark.sql.functions import col, column
df_csv.select(col("DEST_COUNTRY_NAME"), column("count")).show(2)
+-----------------+-----+
|DEST_COUNTRY_NAME|count|
+-----------------+-----+
|    United States|   15|
|    United States|    1|
+-----------------+-----+
only showing top 2 rows

from pyspark.sql.functions import col, column

df_csv.select(col("DEST_COUNTRY_NAME"), column("count")).show(2)

+-----------------+-----+

|DEST_COUNTRY_NAME|count|

+-----------------+-----+

| United States| 15|

| United States| 1|

+-----------------+-----+

only showing top 2 rows

There is another popular function “expr” which can be used to select and perform operations on columns.

from pyspark.sql.functions import expr
df_csv.select(expr("DEST_COUNTRY_NAME as destination" )).show(2)
# shorthand for select and "expr"
df_csv.selectExpr("DEST_COUNTRY_NAME as destination").show(2)

from pyspark.sql.functions import expr

df_csv.select(expr("DEST_COUNTRY_NAME as destination" )).show(2)

# shorthand for select and "expr"

df_csv.selectExpr("DEST_COUNTRY_NAME as destination").show(2)

Listing Columns

There is one simple function in Spark which you can use to list all columns of dataframe.

df_csv.columns
['DEST_COUNTRY_NAME', 'ORIGIN_COUNTRY_NAME', 'count']

1 2	df_csv.columns ['DEST_COUNTRY_NAME', 'ORIGIN_COUNTRY_NAME', 'count']

Adding Columns to dataframe

Spark dataframes are immutable. That means you can not change them once they are created. If you want to change the dataframe any way, you need to create a new one.

In all of the next operations (adding, renaming, and dropping column), I have not created a new dataframe but just used it to print results. If you want to persist these changes just save it to a new dataframe.

We can easily add column using with column function. We can use “expr” function to decide value of new column.

df_csv.withColumn("is_international_flights", \
    expr("DEST_COUNTRY_NAME != ORIGIN_COUNTRY_NAME")) \
    .show(2)

df_csv.withColumn("is_international_flights", \

expr("DEST_COUNTRY_NAME != ORIGIN_COUNTRY_NAME")) \

.show(2)

Renaming Columns

We can use with column to rename column of dataframe.

df_csv.withColumn("destination", expr("DEST_COUNTRY_NAME")).show(2)
+-----------------+-------------------+-----+-------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|  destination|
+-----------------+-------------------+-----+-------------+
|    United States|            Romania|   15|United States|
|    United States|            Croatia|    1|United States|
+-----------------+-------------------+-----+-------------+

df_csv.withColumn("destination", expr("DEST_COUNTRY_NAME")).show(2)

+-----------------+-------------------+-----+-------------+

+-----------------+-------------------+-----+-------------+

+-----------------+-------------------+-----+-------------+

You can see that, this is actually adding new column with new name to dataframe. We can use select to remove old column but that is one extra step. There is another function in spark which renames existing column.

df_csv.withColumnRenamed("DEST_COUNTRY_NAME", "destination").show(2)

1	df_csv.withColumnRenamed("DEST_COUNTRY_NAME", "destination").show(2)

Dropping Column

Spark provides simple function to drop columns from dataframe.

df_csv.drop("count").show(2)

1	df_csv.drop("count").show(2)

Conclusion

We have gone through some basic operations to handle columns in spark dataframe. When we are analyzing data these will be useful. Hope this helps. See you in the next blog.

Add, Rename, Drop Columns in Spark Dataframe

Selecting Columns from Spark Dataframe

Listing Columns

Adding Columns to dataframe

Renaming Columns

Dropping Column

Conclusion

Aggregation Functions in Spark

Date Difference functions in Spark

Reading JSON data in Spark

Select Expr in Spark Dataframe

Converting Strings to Dates in Spark

Adding Custom Schema to Spark Dataframe

Selecting Columns from Spark Dataframe

Listing Columns

Adding Columns to dataframe

Renaming Columns

Dropping Column

Conclusion

Similar Posts