Adding Custom Schema to Spark Dataframe

Updated On September 13, 2020 | By Mahesh Mogal

In the last blog, we have loaded our data to Spark Dataframe. We have also used "inferschema" option to let spark figure out the schema of the Dataframe on its own. But in many cases, you would like to specify a schema for Dataframe. This will give you much better control over column names and especially data types. Let us see how we can add our custom schema while reading data in Spark.

Adding Custom Schema

In spark, schema is array StructField of type StructType. Each StructType has 4 parameters.

  • Column Name
  • Data type of that column
  • Boolean value indication if values in this column can be null or not
  • Metadata column - this is optional column which can be used to add additional information about column

Let us write our custom schema for our flights data.

from pyspark.sql.types import StructField, StructType, StringType,LongType

custom_schema = StructType([
    StructField("destination", StringType(), True),
    StructField("source", StringType(), True),
    StructField("total_flights", LongType(), True),
])

df = spark.read.format("csv") \
    .schema(custom_schema) \
    .option("header", True) \
    .load("data/flights.csv")

df.show(2)
+-------------+-------+-------------+
|  destination| source|total_flights|
+-------------+-------+-------------+
|United States|Romania|           15|
|United States|Croatia|            1|
+-------------+-------+-------------+
only showing top 2 rows

Now our data frame has column names and data types which we have specified. If you want to print schema for any dataframe you can use below function.

df.printSchema()

Using Metadata With Custom Schema

We can add extra information about columns using the metadata filed. This filed takes key-value pairs and we can choose any number of keys and values depending on our needs.

custom_schema_with_metadata = StructType([
    StructField("destination", StringType(), True, metadata={"desc": "destination country for flight"}),
    StructField("source", StringType(), True,  metadata={"desc": "source country of flight"}),
    StructField("total_flights", LongType(), True, metadata={"desc": "Number of flights"}),
])

df = spark.read.format("csv") \
    .schema(custom_schema_with_metadata) \
    .option("header", True) \
    .load("data/flights.csv")

We can check our data frame and its schema now.

custom schema with metadata
Custom schema with Metadata

If you want to check schema with its metadata then we need to use following code. We can read all of schema with this function or also read schema for one column as well.

df.schema.json()

df.schema.fields[0].metadata["desc"]

This is how we can add a custom schema to our dataframes. I hope this helps. See you in the next blog.

.

Mahesh Mogal

I am passionate about Cloud, Data Analytics, Machine Learning, and Artificial Intelligence. I like to learn and try out new things. I have started blogging about my experience while learning these exciting technologies.

Stay Updated with Latest Blogs

Get latest blogs delivered to your mail directly.

Recent Posts

Select Expr in Spark Dataframe

In this blog, we will learn how to use select and expr in the Spark data frame. We will learn multiple use cases along with selectExpr.

Read More
Add, Rename, Drop Columns in Spark Dataframe

We will go through common column operations like add, rename, list, select, and dropping a column from spark dataframe.

Read More
MSCK Repair - Fixing Partitions in Hive Table

We will learn how to add multiple partitions to hive table using msck repair table command in hive.

msck repair hive
Read More

Leave a Reply

Your email address will not be published. Required fields are marked *

linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram
Share via
Copy link
Powered by Social Snap