Reading JSON data in Spark
JSON (Javascript Object Notation) is one of common file formats and there is out of box supports reading JSON data in Spark. In this blog, we are going to learn how to read JSON data from files, folders and different options provided by Spark.
Reading JSON data
We can read JSON data in multiple ways. We can either use format command for directly use JSON option with spark read function. In end, we will get data frame from our data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
df = spark.read.json("D:\\code\\spark\\spark-basics\\data\\flight-data\\json\\2010-summary.json") df.printSchema() root |-- DEST_COUNTRY_NAME: string (nullable = true) |-- ORIGIN_COUNTRY_NAME: string (nullable = true) |-- count: long (nullable = true) df.show(2) +-----------------+-------------------+-----+ |DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count| +-----------------+-------------------+-----+ | United States| Romania| 1| | United States| Ireland| 264| +-----------------+-------------------+-----+ only showing top 2 rows |
We can observe that spark has picked our schema and data types correctly when reading data from JSON file. Below are few variations we can use to read JSON data.
1 2 3 4 5 6 7 8 9 10 11 12 |
df2 = spark.read\ .format("json")\ .load("D:\\code\\spark\\spark-basics\\data\\flight-data\\json\\2011-summary.json") df2.show(2) +-----------------+-------------------+-----+ |DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count| +-----------------+-------------------+-----+ | United States| Saint Martin| 2| | United States| Guinea| 2| +-----------------+-------------------+-----+ only showing top 2 rows |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
df3 = spark.read\ .format("json")\ .option("inferschema", "true")\ .option("path", "D:\\code\\spark\\spark-basics\\data\\flight-data\\json\\2012-summary.json")\ .load() df3.show(2) +-----------------+-------------------+-----+ |DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count| +-----------------+-------------------+-----+ | United States| Croatia| 1| | United States| Ireland| 252| +-----------------+-------------------+-----+ only showing top 2 rows |
Reading Multi-line JSON data
If we check our current data, we can see that it is line delimited. It means each row contains one record of data.
Sometimes, we may have one record spanning over multiple lines. If you are familiar with JSON already, you might have written JON data like below.
1 2 3 4 5 6 7 8 9 10 11 12 |
[ { "ORIGIN_COUNTRY_NAME":"Romania", "DEST_COUNTRY_NAME":"United States", "count":1 }, { "ORIGIN_COUNTRY_NAME":"Ireland", "DEST_COUNTRY_NAME":"United States", "count":264 } ] |
To read data like this, which is split on multiple lines, we have to pass multi line option as true.
1 2 3 4 5 6 7 8 9 10 11 |
df = spark.read\ .option("multiline","true")\ .json("D:\\code\\spark\\spark-basics\\data\\flight-data\\json\\sample.json") df.show() +-----------------+-------------------+-----+ |DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count| +-----------------+-------------------+-----+ | United States| Romania| 1| | United States| Ireland| 264| +-----------------+-------------------+-----+ |
Reading Multiple JSON files at Once
We can pass path of directory / folder to Spark and it will read all JSON files in that location.
1 2 3 4 5 |
df = spark.read\ .json("D:\\code\\spark\\spark-basics\\data\\flight-data\\json") df.count() 1514 |
Using Custom Schema with JSON files
Though spark can detect correct schema from JSON data, it is recommended to provide a custom schema for your data, especially in production loads. We can pass custom schema easily while reading JSON data in Spark.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
from pyspark.sql.types import StructField, StructType, StringType,LongType custom_schema = StructType([ StructField("ORIGIN_COUNTRY_NAME",StringType(),False), StructField("DEST_COUNTRY_NAME",StringType(),False), StructField("count",LongType(),False) ]) df = spark.read\ .schema(custom_schema)\ .json("D:\\code\\spark\\spark-basics\\data\\flight-data\\json\\2010-summary.json") df.show(2) +-------------------+-----------------+-----+ |ORIGIN_COUNTRY_NAME|DEST_COUNTRY_NAME|count| +-------------------+-----------------+-----+ | Romania| United States| 1| | Ireland| United States| 264| +-------------------+-----------------+-----+ only showing top 2 rows |
If you want to learn more about custom schema, then you can go read Adding Custom Schema to Spark Data frame.
When providing custom schema for JSON file, make sure that you provide same column names as of property names in JSON data. For example, if you have “ORIGIN_COUNTRY_NAME” as property is JSON data, then your column name should be same. If you specify any other column name, Spark will try to find out property value with that name and eventually put null value as it won’t find that property in data.
If you are little bit confused, lets look at example where i specify “ct” instead of “count” and check what we will get in data frame.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
from pyspark.sql.types import StructField, StructType, StringType,LongType custom_schema = StructType([ StructField("ORIGIN_COUNTRY_NAME",StringType(),False), StructField("DEST_COUNTRY_NAME",StringType(),False), StructField("ct",LongType(),False) ]) df = spark.read\ .schema(custom_schema)\ .json("D:\\code\\spark\\spark-basics\\data\\flight-data\\json\\2010-summary.json") df.show(2) +-------------------+-----------------+----+ |ORIGIN_COUNTRY_NAME|DEST_COUNTRY_NAME| ct| +-------------------+-----------------+----+ | Romania| United States|null| | Ireland| United States|null| +-------------------+-----------------+----+ only showing top 2 rows |
We are getting null in “ct” column as there is no field (property) named “ct” in our JSON data. So always have same column names from your JSON file/data when providing custom schema to Spark read command.
More Options While Reading JSON Data
We have covered most used Spark options when working with JSON data. There are few more options which can be useful depending on your use case. If you need detail explanation about it, let me know. I will create a new blog for them as well.
JSON Option | Acceptable Values | Purpose |
---|---|---|
dateFormat | String in Java's simpleDateFormat (yyyy-mm-dd) | Date format in data. |
timestampFormat | Time stamp string in Java's simpleDateFormat | Time stamp format in data |
maxColumns | Any integer | Maximum number of columns to be read from file |
allowComments | true or false | Allowing comments in JSON data |
allowSingleQuotes | true or false | reading JSON with single quotes |
allowUnquotedFieldNames | true or false | reading JSON filed names(properties) without any quotes |
multiline | true or false | Reading JSON data split on multiple lines |
Conclusion
In this blog, we have learned how to read JSON data from Spark. We have also gone through most used options provided by spark when dealing with JSON data. You can find code in this blog at git repo. I hope you found this useful. See you in next Blog.