Add, Rename, Drop Columns in Spark Dataframe

Updated On September 13, 2020 | By Mahesh Mogal

In this blog, we will go through some of the most used column operations performed on columns of a data frame in Spark. We will start with how to select columns from dataframe. After that, we will go through how to add, rename, and drop columns from spark dataframe. Let us get started.

Selecting Columns from Spark Dataframe

There are multiple ways we can select columns from dataframe. one of Easiest way is to use column names as string in select function of dataframe.

df_csv = spark.read.format("csv") \
        .option("inferSchema", "true") \
        .option("header","true") \
        .load("data/flights.csv")

df_csv.select("DEST_COUNTRY_NAME").show(2)

Spark has also provided few inbuilt function to work with columns. Before we can use them we need to import those functions.

from pyspark.sql.functions import col, column

df_csv.select(col("DEST_COUNTRY_NAME"), column("count")).show(2)

+-----------------+-----+
|DEST_COUNTRY_NAME|count|
+-----------------+-----+
|    United States|   15|
|    United States|    1|
+-----------------+-----+
only showing top 2 rows

There is another popular function "expr" which can be used to select and perform operations on columns.

from pyspark.sql.functions import expr

df_csv.select(expr("DEST_COUNTRY_NAME as destination" )).show(2)

# shorthand for select and "expr"
df_csv.selectExpr("DEST_COUNTRY_NAME as destination").show(2)

Listing Columns

There is one simple function in Spark which you can use to list all columns of dataframe.

df_csv.columns

['DEST_COUNTRY_NAME', 'ORIGIN_COUNTRY_NAME', 'count']

Adding Columns to dataframe

Spark dataframes are immutable. That means you can not change them once they are created. If you want to change the dataframe any way, you need to create a new one.

In all of the next operations (adding, renaming, and dropping column), I have not created a new dataframe but just used it to print results. If you want to persist these changes just save it to a new dataframe.

We can easily add column using with column function. We can use "expr" function to decide value of new column.

df_csv.withColumn("is_international_flights", \
    expr("DEST_COUNTRY_NAME != ORIGIN_COUNTRY_NAME")) \
    .show(2)
Adding New Column to Spark Dataframe
Adding New Column to Spark Dataframe

Renaming Columns

We can use with column to rename column of dataframe.

df_csv.withColumn("destination", expr("DEST_COUNTRY_NAME")).show(2)

+-----------------+-------------------+-----+-------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|  destination|
+-----------------+-------------------+-----+-------------+
|    United States|            Romania|   15|United States|
|    United States|            Croatia|    1|United States|
+-----------------+-------------------+-----+-------------+

You can see that, this is actually adding new column with new name to dataframe. We can use select to remove old column but that is one extra step. There is another function in spark which renames existing column.

df_csv.withColumnRenamed("DEST_COUNTRY_NAME", "destination").show(2)
Renaming Column in Spark Dataframe
Renaming Column in Spark Dataframe

Dropping Column

Spark provides simple function to drop columns from dataframe.

df_csv.drop("count").show(2)
Dropping Column From Spark Dataframe
Dropping Column From Spark Dataframe

Conclusion

We have gone through some basic operations to handle columns in spark dataframe. When we are analyzing data these will be useful. Hope this helps. See you in the next blog.

.

Mahesh Mogal

I am passionate about Cloud, Data Analytics, Machine Learning, and Artificial Intelligence. I like to learn and try out new things. I have started blogging about my experience while learning these exciting technologies.

Stay Updated with Latest Blogs

Get latest blogs delivered to your mail directly.

Recent Posts

Select Expr in Spark Dataframe

In this blog, we will learn how to use select and expr in the Spark data frame. We will learn multiple use cases along with selectExpr.

Read More
Add, Rename, Drop Columns in Spark Dataframe

We will go through common column operations like add, rename, list, select, and dropping a column from spark dataframe.

Read More
MSCK Repair - Fixing Partitions in Hive Table

We will learn how to add multiple partitions to hive table using msck repair table command in hive.

msck repair hive
Read More

Leave a Reply

Your email address will not be published. Required fields are marked *

linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram
Share via
Copy link
Powered by Social Snap