Reading Data from different files in Spark

Spark works with multiple types of file formats and data types. Before we can use spark to process data we need to load data to spark dataframes. In this blog, we will learn how to load data to Spark Dataframe.

In this blog series, we are mostly going to use python with spark(pyspark). But you can correlate the same code for Scala, Java, or R. Also you can download data used in examples at GitHub link.

Reading JSON File in Spark

We can use following simple command to read data from JSON file in spark.

df = spark.read.format("json").load("data/flights.json")

1	df = spark.read.format("json").load("data/flights.json")

As JSON is structured data, Spark can easily infer the schema from this file and show proper column names.

df.show(2)
+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
+-----------------+-------------------+-----+
only showing top 2 rows
df.printSchema()
root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)

df.show(2)

+-----------------+-------------------+-----+

|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|

+-----------------+-------------------+-----+

| United States| Romania| 15|

| United States| Croatia| 1|

+-----------------+-------------------+-----+

only showing top 2 rows

df.printSchema()

root

|-- DEST_COUNTRY_NAME: string (nullable = true)

|-- ORIGIN_COUNTRY_NAME: string (nullable = true)

|-- count: long (nullable = true)

There is also shorthand way to load JSON data in Spark.

df2 = spark.read.json("data/flights.json")

1	df2 = spark.read.json("data/flights.json")

If you want to learn more about how to handle JSON data in Spark, you can read it here Reading JSON data in Spark

Reading CSV files in Spark

We can use same read command but format as “CSV” to read csv files in Spark. But you will see one problem with csv files.

df_csv = spark.read.format("csv").load("data/flights.csv")
df_csv.show(2)
+-----------------+-------------------+-----+
|              _c0|                _c1|  _c2|
+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
|    United States|            Romania|   15|
+-----------------+-------------------+-----+
only showing top 2 rows

df_csv = spark.read.format("csv").load("data/flights.csv")

df_csv.show(2)

+-----------------+-------------------+-----+

| _c0| _c1| _c2|

+-----------------+-------------------+-----+

|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|

| United States| Romania| 15|

+-----------------+-------------------+-----+

only showing top 2 rows

Here we can see that spark cannot read columns properly from csv file. But there is an easy workaround for this.

df_csv = spark.read.format("csv") \
...          .option("inferSchema", "true") \
...          .option("header","true") \
...          .load("data/flights.csv")
df_csv.show(2)
+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
+-----------------+-------------------+-----+

df_csv = spark.read.format("csv") \

... .option("inferSchema", "true") \

... .option("header","true") \

... .load("data/flights.csv")

df_csv.show(2)

+-----------------+-------------------+-----+

|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|

+-----------------+-------------------+-----+

| United States| Romania| 15|

| United States| Croatia| 1|

+-----------------+-------------------+-----+

By setting “inferSchema” and “header” to true, we can let spark know that first row in data in header and you can use that to infer schema for this data frame. By default these values are false.

There are multiple options that we can set up while reading data. If you do not want to write multiple “option” statements, there is also a shorthand way that we can use.

df_csv1 = spark.read.csv("data/flights.csv", inferSchema="true", header="true")

1	df_csv1 = spark.read.csv("data/flights.csv", inferSchema="true", header="true")

If you want to learn more about reading CSV data in Spark, you can read it at Read CSV Data in Spark

Reading TSV (Tab-separated) or “|” pipe separated data

What if our data is neither JSON nor CSV? We can have data in the tab-separated format or | (pipe) or any other character. We can read such files using the same CSV read command. We only need to specify our filed separator character.

df_tsv = spark.read.csv("data/flights.tsv", sep="\t", inferSchema="true", header="true")
df_tsv.show(2)
+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
+-----------------+-------------------+-----+
only showing top 2 rows

df_tsv = spark.read.csv("data/flights.tsv", sep="\t", inferSchema="true", header="true")

df_tsv.show(2)

+-----------------+-------------------+-----+

|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|

+-----------------+-------------------+-----+

| United States| Romania| 15|

| United States| Croatia| 1|

+-----------------+-------------------+-----+

only showing top 2 rows

In our read command, you can notice that we have specified the “sep” option. This is used to let spark know what kind of separator is used between fields in the file. You can use the same command to use “|” pipe or any other type of file to read data in spark.

df_pipe = spark.read \
...             .format("csv") \
...             .option("sep","|")\
...             .option("inferSchema","true")\
...             .option("header","true")\
...             .load("data/flights_pipe.txt")

df_pipe = spark.read \

... .format("csv") \

... .option("sep","|")\

... .option("inferSchema","true")\

... .option("header","true")\

... .load("data/flights_pipe.txt")

Conclusion

In this blog, we have learned how to load data from files to Spark Dataframes. We have also given hint to Spark that it should try to guess schema from our file. In the next blog, we will learn how to specify our own schema and use that when reading data.

One Comment

Abhijit says:

November 27, 2020 at 2:36 AM

Hi,
very good explanation of the topic…
will you help me in reading text file by discarding commented line by ‘#’ . below is the excerpt of text file
——————————————————————————————————————
#Software: Microsoft Internet Information Services 6.0
#Version: 1.0
#Date: 2018-01-01 09:15:23
#Fields: date time c-ip cs-username s-ip s-port cs-method cs-uri-stem cs-uri-query sc-status sc-bytes cs-bytes time-taken cs(User-Agent) cs(Referrer)
2018-01-01 00:01:03 198.51.100.2 – 192.0.0.1 80 GET /default.aspx – 200 1000 1000 100 Mozilla/5.0+(compatible;+MSIE+10.0;+Windows+NT+6.1;+WOW64;+Trident/6.0) http://www.bing.com
2018-01-01 00:01:09 198.51.100.2 – 192.0.0.1 80 GET /default.aspx – 200 1000 1000 100 Mozilla/5.0+(compatible;+MSIE+10.0;+Windows+NT+6.1;+WOW64;+Trident/6.0) http://www.bing.com
——————————————————————————————————————-

Comments are closed.

Reading data from a file in Spark

Reading JSON File in Spark

Reading CSV files in Spark

Reading TSV (Tab-separated) or “|” pipe separated data

Conclusion

Apache Sqoop – Import data to HDFS

Distinct Rows and Distinct Count from Spark Dataframe

Reading Parquet and ORC data in Spark

Integrate Spark with Jupyter Notebook and Visual Studio Code

Adding White Spaces to Data in Spark Dataframe

Running SQL queries on Spark DataFrames

One Comment

Reading JSON File in Spark

Reading CSV files in Spark

Reading TSV (Tab-separated) or “|” pipe separated data

Conclusion

Similar Posts

One Comment