Reading JSON data in Spark

JSON (Javascript Object Notation) is one of common file formats and there is out of box supports reading JSON data in Spark. In this blog, we are going to learn how to read JSON data from files, folders and different options provided by Spark.

Reading JSON data

We can read JSON data in multiple ways. We can either use format command for directly use JSON option with spark read function. In end, we will get data frame from our data.

We can observe that spark has picked our schema and data types correctly when reading data from JSON file. Below are few variations we can use to read JSON data.

Reading Multi-line JSON data

If we check our current data, we can see that it is line delimited. It means each row contains one record of data.

JSON data example
Line delimited JSON data

Sometimes, we may have one record spanning over multiple lines. If you are familiar with JSON already, you might have written JON data like below.

To read data like this, which is split on multiple lines, we have to pass multi line option as true.

Reading Multiple JSON files at Once

We can pass path of directory / folder to Spark and it will read all JSON files in that location.

Using Custom Schema with JSON files

Though spark can detect correct schema from JSON data, it is recommended to provide a custom schema for your data, especially in production loads. We can pass custom schema easily while reading JSON data in Spark.

If you want to learn more about custom schema, then you can go read Adding Custom Schema to Spark Data frame.

When providing custom schema for JSON file, make sure that you provide same column names as of property names in JSON data. For example, if you have “ORIGIN_COUNTRY_NAME” as property is JSON data, then your column name should be same. If you specify any other column name, Spark will try to find out property value with that name and eventually put null value as it won’t find that property in data.

If you are little bit confused, lets look at example where i specify “ct” instead of “count” and check what we will get in data frame.

We are getting null in “ct” column as there is no field (property) named “ct” in our JSON data. So always have same column names from your JSON file/data when providing custom schema to Spark read command.

More Options While Reading JSON Data

We have covered most used Spark options when working with JSON data. There are few more options which can be useful depending on your use case. If you need detail explanation about it, let me know. I will create a new blog for them as well.

JSON OptionAcceptable ValuesPurpose
dateFormatString in Java's simpleDateFormat (yyyy-mm-dd)Date format in data.
timestampFormatTime stamp string in Java's simpleDateFormat Time stamp format in data
maxColumnsAny integerMaximum number of columns to be read from file
allowCommentstrue or falseAllowing comments in JSON data
allowSingleQuotestrue or falsereading JSON with single quotes
allowUnquotedFieldNamestrue or falsereading JSON filed names(properties) without any quotes
multilinetrue or falseReading JSON data split on multiple lines

Conclusion

In this blog, we have learned how to read JSON data from Spark. We have also gone through most used options provided by spark when dealing with JSON data. You can find code in this blog at git repo. I hope you found this useful. See you in next Blog.

Similar Posts

Leave a Reply

Your email address will not be published.