Add, Rename, Drop Columns in Spark Dataframe
In this blog, we will go through some of the most used column operations performed on columns of a data frame in Spark. We will start with how to select columns from dataframe. After that, we will go through how to add, rename, and drop columns from spark dataframe. Let us get started.
Selecting Columns from Spark Dataframe
There are multiple ways we can select columns from dataframe. one of Easiest way is to use column names as string in select function of dataframe.
1 2 3 4 5 |
df_csv = spark.read.format("csv") \ .option("inferSchema", "true") \ .option("header","true") \ .load("data/flights.csv") df_csv.select("DEST_COUNTRY_NAME").show(2) |
Spark has also provided few inbuilt function to work with columns. Before we can use them we need to import those functions.
1 2 3 4 5 6 7 8 9 |
from pyspark.sql.functions import col, column df_csv.select(col("DEST_COUNTRY_NAME"), column("count")).show(2) +-----------------+-----+ |DEST_COUNTRY_NAME|count| +-----------------+-----+ | United States| 15| | United States| 1| +-----------------+-----+ only showing top 2 rows |
There is another popular function “expr” which can be used to select and perform operations on columns.
1 2 3 4 |
from pyspark.sql.functions import expr df_csv.select(expr("DEST_COUNTRY_NAME as destination" )).show(2) # shorthand for select and "expr" df_csv.selectExpr("DEST_COUNTRY_NAME as destination").show(2) |
Listing Columns
There is one simple function in Spark which you can use to list all columns of dataframe.
1 2 |
df_csv.columns ['DEST_COUNTRY_NAME', 'ORIGIN_COUNTRY_NAME', 'count'] |
Adding Columns to dataframe
Spark dataframes are immutable. That means you can not change them once they are created. If you want to change the dataframe any way, you need to create a new one.
In all of the next operations (adding, renaming, and dropping column), I have not created a new dataframe but just used it to print results. If you want to persist these changes just save it to a new dataframe.
We can easily add column using with column function. We can use “expr” function to decide value of new column.
1 2 3 |
df_csv.withColumn("is_international_flights", \ expr("DEST_COUNTRY_NAME != ORIGIN_COUNTRY_NAME")) \ .show(2) |
Renaming Columns
We can use with column to rename column of dataframe.
1 2 3 4 5 6 7 |
df_csv.withColumn("destination", expr("DEST_COUNTRY_NAME")).show(2) +-----------------+-------------------+-----+-------------+ |DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count| destination| +-----------------+-------------------+-----+-------------+ | United States| Romania| 15|United States| | United States| Croatia| 1|United States| +-----------------+-------------------+-----+-------------+ |
You can see that, this is actually adding new column with new name to dataframe. We can use select to remove old column but that is one extra step. There is another function in spark which renames existing column.
1 |
df_csv.withColumnRenamed("DEST_COUNTRY_NAME", "destination").show(2) |
Dropping Column
Spark provides simple function to drop columns from dataframe.
1 |
df_csv.drop("count").show(2) |
Conclusion
We have gone through some basic operations to handle columns in spark dataframe. When we are analyzing data these will be useful. Hope this helps. See you in the next blog.