Reading data from a file in Spark

Updated On September 13, 2020 | By Mahesh Mogal

Spark works with multiple types of file formats and data types. Before we can use spark to process data we need to load data to spark dataframes. In this blog, we will learn how to load data to Spark Dataframe.

In this blog series, we are mostly going to use python with spark(pyspark). But you can correlate the same code for Scala, Java, or R. Also you can download data used in examples at GitHub link.

Reading JSON File in Spark

We can use following simple command to read data from JSON file in spark.

df = spark.read.format("json").load("data/flights.json")

As JSON is structured data, Spark can easily infer the schema from this file and show proper column names.

df.show(2)
+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
+-----------------+-------------------+-----+
only showing top 2 rows

df.printSchema()
root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)

There is also shorthand way to load JSON data in Spark.

df2 = spark.read.json("data/flights.json")

Reading CSV files in Spark

We can use same read command but format as "CSV" to read csv files in Spark. But you will see one problem with csv files.

df_csv = spark.read.format("csv").load("data/flights.csv")

df_csv.show(2)
+-----------------+-------------------+-----+
|              _c0|                _c1|  _c2|
+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
|    United States|            Romania|   15|
+-----------------+-------------------+-----+
only showing top 2 rows

Here we can see that spark is not able to read columns properly from csv file. But there is an easy workaround for this.

df_csv = spark.read.format("csv") \
...          .option("inferSchema", "true") \
...          .option("header","true") \
...          .load("data/flights.csv")

df_csv.show(2)
+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
+-----------------+-------------------+-----+

By setting "inferSchema" and "header" to true, we can let spark know that first row in data in header and you can use that to infer schema for this data frame. By default these values are false.

There are multiple options that we can set up while reading data. If you do not want to write multiple "option" statements, there is also a shorthand way that we can use.

df_csv1 = spark.read.csv("data/flights.csv", inferSchema="true", header="true")

Reading TSV (Tab-separated) or "|" pipe separated data

What if our data is neither JSON nor CSV? We can have data in the tab-separated format or | (pipe) or any other character. We can read such files using the same CSV read command. We only need to specify our filed separator character.

df_tsv = spark.read.csv("data/flights.tsv", sep="\t", inferSchema="true", header="true")

df_tsv.show(2)
+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
+-----------------+-------------------+-----+
only showing top 2 rows

In our read command, you can notice that we have specified the "sep" option. This is used to let spark know what kind of separator is used between fields in the file. You can use the same command to use "|" pipe or any other type of file to read data in spark.

df_pipe = spark.read \
...             .format("csv") \
...             .option("sep","|")\
...             .option("inferSchema","true")\
...             .option("header","true")\
...             .load("data/flights_pipe.txt")

Conclusion

In this blog, we have learned how to load data from files to Spark Dataframes. We have also given hint to Spark that it should try to guess schema from our file. In the next blog, we will learn how to specify our own schema and use that when reading data.

.

Mahesh Mogal

I am passionate about Cloud, Data Analytics, Machine Learning, and Artificial Intelligence. I like to learn and try out new things. I have started blogging about my experience while learning these exciting technologies.

Stay Updated with Latest Blogs

Get latest blogs delivered to your mail directly.

Recent Posts

Select Expr in Spark Dataframe

In this blog, we will learn how to use select and expr in the Spark data frame. We will learn multiple use cases along with selectExpr.

Read More
Add, Rename, Drop Columns in Spark Dataframe

We will go through common column operations like add, rename, list, select, and dropping a column from spark dataframe.

Read More
MSCK Repair - Fixing Partitions in Hive Table

We will learn how to add multiple partitions to hive table using msck repair table command in hive.

msck repair hive
Read More

Leave a Reply

Your email address will not be published. Required fields are marked *

linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram
Share via
Copy link
Powered by Social Snap