Reading data from a file in Spark
Spark works with multiple types of file formats and data types. Before we can use spark to process data we need to load data to spark dataframes. In this blog, we will learn how to load data to Spark Dataframe.
In this blog series, we are mostly going to use python with spark(pyspark). But you can correlate the same code for Scala, Java, or R. Also you can download data used in examples at GitHub link.
Reading JSON File in Spark
We can use following simple command to read data from JSON file in spark.
1 |
df = spark.read.format("json").load("data/flights.json") |
As JSON is structured data, Spark can easily infer the schema from this file and show proper column names.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
df.show(2) +-----------------+-------------------+-----+ |DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count| +-----------------+-------------------+-----+ | United States| Romania| 15| | United States| Croatia| 1| +-----------------+-------------------+-----+ only showing top 2 rows df.printSchema() root |-- DEST_COUNTRY_NAME: string (nullable = true) |-- ORIGIN_COUNTRY_NAME: string (nullable = true) |-- count: long (nullable = true) |
There is also shorthand way to load JSON data in Spark.
1 |
df2 = spark.read.json("data/flights.json") |
If you want to learn more about how to handle JSON data in Spark, you can read it here Reading JSON data in Spark
Reading CSV files in Spark
We can use same read command but format as “CSV” to read csv files in Spark. But you will see one problem with csv files.
1 2 3 4 5 6 7 8 9 |
df_csv = spark.read.format("csv").load("data/flights.csv") df_csv.show(2) +-----------------+-------------------+-----+ | _c0| _c1| _c2| +-----------------+-------------------+-----+ |DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count| | United States| Romania| 15| +-----------------+-------------------+-----+ only showing top 2 rows |
Here we can see that spark cannot read columns properly from csv file. But there is an easy workaround for this.
1 2 3 4 5 6 7 8 9 10 11 |
df_csv = spark.read.format("csv") \ ... .option("inferSchema", "true") \ ... .option("header","true") \ ... .load("data/flights.csv") df_csv.show(2) +-----------------+-------------------+-----+ |DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count| +-----------------+-------------------+-----+ | United States| Romania| 15| | United States| Croatia| 1| +-----------------+-------------------+-----+ |
By setting “inferSchema” and “header” to true, we can let spark know that first row in data in header and you can use that to infer schema for this data frame. By default these values are false.
There are multiple options that we can set up while reading data. If you do not want to write multiple “option” statements, there is also a shorthand way that we can use.
1 |
df_csv1 = spark.read.csv("data/flights.csv", inferSchema="true", header="true") |
If you want to learn more about reading CSV data in Spark, you can read it at Read CSV Data in Spark
Reading TSV (Tab-separated) or “|” pipe separated data
What if our data is neither JSON nor CSV? We can have data in the tab-separated format or | (pipe) or any other character. We can read such files using the same CSV read command. We only need to specify our filed separator character.
1 2 3 4 5 6 7 8 9 |
df_tsv = spark.read.csv("data/flights.tsv", sep="\t", inferSchema="true", header="true") df_tsv.show(2) +-----------------+-------------------+-----+ |DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count| +-----------------+-------------------+-----+ | United States| Romania| 15| | United States| Croatia| 1| +-----------------+-------------------+-----+ only showing top 2 rows |
In our read command, you can notice that we have specified the “sep” option. This is used to let spark know what kind of separator is used between fields in the file. You can use the same command to use “|” pipe or any other type of file to read data in spark.
1 2 3 4 5 6 |
df_pipe = spark.read \ ... .format("csv") \ ... .option("sep","|")\ ... .option("inferSchema","true")\ ... .option("header","true")\ ... .load("data/flights_pipe.txt") |
Conclusion
In this blog, we have learned how to load data from files to Spark Dataframes. We have also given hint to Spark that it should try to guess schema from our file. In the next blog, we will learn how to specify our own schema and use that when reading data.
Hi,
very good explanation of the topic…
will you help me in reading text file by discarding commented line by ‘#’ . below is the excerpt of text file
——————————————————————————————————————
#Software: Microsoft Internet Information Services 6.0
#Version: 1.0
#Date: 2018-01-01 09:15:23
#Fields: date time c-ip cs-username s-ip s-port cs-method cs-uri-stem cs-uri-query sc-status sc-bytes cs-bytes time-taken cs(User-Agent) cs(Referrer)
2018-01-01 00:01:03 198.51.100.2 – 192.0.0.1 80 GET /default.aspx – 200 1000 1000 100 Mozilla/5.0+(compatible;+MSIE+10.0;+Windows+NT+6.1;+WOW64;+Trident/6.0) http://www.bing.com
2018-01-01 00:01:09 198.51.100.2 – 192.0.0.1 80 GET /default.aspx – 200 1000 1000 100 Mozilla/5.0+(compatible;+MSIE+10.0;+Windows+NT+6.1;+WOW64;+Trident/6.0) http://www.bing.com
——————————————————————————————————————-