Reading JSON data in Spark

JSON (Javascript Object Notation) is one of common file formats and there is out of box supports reading JSON data in Spark. In this blog, we are going to learn how to read JSON data from files, folders and different options provided by Spark.

Reading JSON data

We can read JSON data in multiple ways. We can either use format command for directly use JSON option with spark read function. In end, we will get data frame from our data.

df = spark.read.json("D:\\code\\spark\\spark-basics\\data\\flight-data\\json\\2010-summary.json")

df.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)

df.show(2)
+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|    1|
|    United States|            Ireland|  264|
+-----------------+-------------------+-----+
only showing top 2 rows

df = spark.read.json("D:\\code\\spark\\spark-basics\\data\\flight-data\\json\\2010-summary.json")

df.printSchema()

root

|-- DEST_COUNTRY_NAME: string (nullable = true)

|-- ORIGIN_COUNTRY_NAME: string (nullable = true)

|-- count: long (nullable = true)

df.show(2)

+-----------------+-------------------+-----+

|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|

+-----------------+-------------------+-----+

| United States| Romania| 1|

| United States| Ireland| 264|

+-----------------+-------------------+-----+

only showing top 2 rows

We can observe that spark has picked our schema and data types correctly when reading data from JSON file. Below are few variations we can use to read JSON data.

df2 = spark.read\
    .format("json")\
    .load("D:\\code\\spark\\spark-basics\\data\\flight-data\\json\\2011-summary.json")
df2.show(2)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|       Saint Martin|    2|
|    United States|             Guinea|    2|
+-----------------+-------------------+-----+
only showing top 2 rows

df2 = spark.read\

.format("json")\

.load("D:\\code\\spark\\spark-basics\\data\\flight-data\\json\\2011-summary.json")

df2.show(2)

+-----------------+-------------------+-----+

|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|

+-----------------+-------------------+-----+

| United States| Saint Martin| 2|

| United States| Guinea| 2|

+-----------------+-------------------+-----+

only showing top 2 rows

df3 = spark.read\
    .format("json")\
    .option("inferschema", "true")\
    .option("path", "D:\\code\\spark\\spark-basics\\data\\flight-data\\json\\2012-summary.json")\
    .load()
df3.show(2)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Croatia|    1|
|    United States|            Ireland|  252|
+-----------------+-------------------+-----+
only showing top 2 rows

df3 = spark.read\

.format("json")\

.option("inferschema", "true")\

.option("path", "D:\\code\\spark\\spark-basics\\data\\flight-data\\json\\2012-summary.json")\

.load()

df3.show(2)

+-----------------+-------------------+-----+

|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|

+-----------------+-------------------+-----+

| United States| Croatia| 1|

| United States| Ireland| 252|

+-----------------+-------------------+-----+

only showing top 2 rows

Reading Multi-line JSON data

If we check our current data, we can see that it is line delimited. It means each row contains one record of data.

JSON data example — Line delimited JSON data

Sometimes, we may have one record spanning over multiple lines. If you are familiar with JSON already, you might have written JON data like below.

[
   {
      "ORIGIN_COUNTRY_NAME":"Romania",
      "DEST_COUNTRY_NAME":"United States",
      "count":1
   },
   {
      "ORIGIN_COUNTRY_NAME":"Ireland",
      "DEST_COUNTRY_NAME":"United States",
      "count":264
   }
]

[

{

"ORIGIN_COUNTRY_NAME":"Romania",

"DEST_COUNTRY_NAME":"United States",

"count":1

{

"ORIGIN_COUNTRY_NAME":"Ireland",

"DEST_COUNTRY_NAME":"United States",

"count":264

}

]

To read data like this, which is split on multiple lines, we have to pass multi line option as true.

df = spark.read\
    .option("multiline","true")\
    .json("D:\\code\\spark\\spark-basics\\data\\flight-data\\json\\sample.json")
df.show()

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|    1|
|    United States|            Ireland|  264|
+-----------------+-------------------+-----+

df = spark.read\

.option("multiline","true")\

.json("D:\\code\\spark\\spark-basics\\data\\flight-data\\json\\sample.json")

df.show()

+-----------------+-------------------+-----+

|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|

+-----------------+-------------------+-----+

| United States| Romania| 1|

| United States| Ireland| 264|

+-----------------+-------------------+-----+

Reading Multiple JSON files at Once

We can pass path of directory / folder to Spark and it will read all JSON files in that location.

df = spark.read\
    .json("D:\\code\\spark\\spark-basics\\data\\flight-data\\json")
df.count()

1514

df = spark.read\

.json("D:\\code\\spark\\spark-basics\\data\\flight-data\\json")

df.count()

1514

Using Custom Schema with JSON files

Though spark can detect correct schema from JSON data, it is recommended to provide a custom schema for your data, especially in production loads. We can pass custom schema easily while reading JSON data in Spark.

from pyspark.sql.types import StructField, StructType, StringType,LongType
custom_schema = StructType([
    StructField("ORIGIN_COUNTRY_NAME",StringType(),False),
    StructField("DEST_COUNTRY_NAME",StringType(),False),
    StructField("count",LongType(),False)
])
df = spark.read\
    .schema(custom_schema)\
    .json("D:\\code\\spark\\spark-basics\\data\\flight-data\\json\\2010-summary.json")
df.show(2)

+-------------------+-----------------+-----+
|ORIGIN_COUNTRY_NAME|DEST_COUNTRY_NAME|count|
+-------------------+-----------------+-----+
|            Romania|    United States|    1|
|            Ireland|    United States|  264|
+-------------------+-----------------+-----+
only showing top 2 rows

from pyspark.sql.types import StructField, StructType, StringType,LongType

custom_schema = StructType([

StructField("ORIGIN_COUNTRY_NAME",StringType(),False),

StructField("DEST_COUNTRY_NAME",StringType(),False),

StructField("count",LongType(),False)

])

df = spark.read\

.schema(custom_schema)\

.json("D:\\code\\spark\\spark-basics\\data\\flight-data\\json\\2010-summary.json")

df.show(2)

+-------------------+-----------------+-----+

|ORIGIN_COUNTRY_NAME|DEST_COUNTRY_NAME|count|

+-------------------+-----------------+-----+

| Romania| United States| 1|

| Ireland| United States| 264|

+-------------------+-----------------+-----+

only showing top 2 rows

If you want to learn more about custom schema, then you can go read Adding Custom Schema to Spark Data frame.

When providing custom schema for JSON file, make sure that you provide same column names as of property names in JSON data. For example, if you have “ORIGIN_COUNTRY_NAME” as property is JSON data, then your column name should be same. If you specify any other column name, Spark will try to find out property value with that name and eventually put null value as it won’t find that property in data.

If you are little bit confused, lets look at example where i specify “ct” instead of “count” and check what we will get in data frame.

from pyspark.sql.types import StructField, StructType, StringType,LongType
custom_schema = StructType([
    StructField("ORIGIN_COUNTRY_NAME",StringType(),False),
    StructField("DEST_COUNTRY_NAME",StringType(),False),
    StructField("ct",LongType(),False)
])
df = spark.read\
    .schema(custom_schema)\
    .json("D:\\code\\spark\\spark-basics\\data\\flight-data\\json\\2010-summary.json")
df.show(2)

+-------------------+-----------------+----+
|ORIGIN_COUNTRY_NAME|DEST_COUNTRY_NAME|  ct|
+-------------------+-----------------+----+
|            Romania|    United States|null|
|            Ireland|    United States|null|
+-------------------+-----------------+----+
only showing top 2 rows

from pyspark.sql.types import StructField, StructType, StringType,LongType

custom_schema = StructType([

StructField("ORIGIN_COUNTRY_NAME",StringType(),False),

StructField("DEST_COUNTRY_NAME",StringType(),False),

StructField("ct",LongType(),False)

])

df = spark.read\

.schema(custom_schema)\

.json("D:\\code\\spark\\spark-basics\\data\\flight-data\\json\\2010-summary.json")

df.show(2)

+-------------------+-----------------+----+

|ORIGIN_COUNTRY_NAME|DEST_COUNTRY_NAME| ct|

+-------------------+-----------------+----+

| Romania| United States|null|

| Ireland| United States|null|

+-------------------+-----------------+----+

only showing top 2 rows

We are getting null in “ct” column as there is no field (property) named “ct” in our JSON data. So always have same column names from your JSON file/data when providing custom schema to Spark read command.

JSON Option	Acceptable Values	Purpose
dateFormat	String in Java's simpleDateFormat (yyyy-mm-dd)	Date format in data.
timestampFormat	Time stamp string in Java's simpleDateFormat	Time stamp format in data
maxColumns	Any integer	Maximum number of columns to be read from file
allowComments	true or false	Allowing comments in JSON data
allowSingleQuotes	true or false	reading JSON with single quotes
allowUnquotedFieldNames	true or false	reading JSON filed names(properties) without any quotes
multiline	true or false	Reading JSON data split on multiple lines

Conclusion

In this blog, we have learned how to read JSON data from Spark. We have also gone through most used options provided by spark when dealing with JSON data. You can find code in this blog at git repo. I hope you found this useful. See you in next Blog.

Reading JSON data in Spark