Reading data from a file in Spark

Updated On February 12, 2021 | By Mahesh Mogal

Spark works with multiple types of file formats and data types. Before we can use spark to process data we need to load data to spark dataframes. In this blog, we will learn how to load data to Spark Dataframe.

In this blog series, we are mostly going to use python with spark(pyspark). But you can correlate the same code for Scala, Java, or R. Also you can download data used in examples at GitHub link.

Reading JSON File in Spark

We can use following simple command to read data from JSON file in spark.

As JSON is structured data, Spark can easily infer the schema from this file and show proper column names.

There is also shorthand way to load JSON data in Spark.

Reading CSV files in Spark

We can use same read command but format as "CSV" to read csv files in Spark. But you will see one problem with csv files.

Here we can see that spark cannot read columns properly from csv file. But there is an easy workaround for this.

By setting "inferSchema" and "header" to true, we can let spark know that first row in data in header and you can use that to infer schema for this data frame. By default these values are false.

There are multiple options that we can set up while reading data. If you do not want to write multiple "option" statements, there is also a shorthand way that we can use.

Reading TSV (Tab-separated) or "|" pipe separated data

What if our data is neither JSON nor CSV? We can have data in the tab-separated format or | (pipe) or any other character. We can read such files using the same CSV read command. We only need to specify our filed separator character.

In our read command, you can notice that we have specified the "sep" option. This is used to let spark know what kind of separator is used between fields in the file. You can use the same command to use "|" pipe or any other type of file to read data in spark.

Conclusion

In this blog, we have learned how to load data from files to Spark Dataframes. We have also given hint to Spark that it should try to guess schema from our file. In the next blog, we will learn how to specify our own schema and use that when reading data.

Reading data from a file in Spark

Mahesh Mogal

I am passionate about Cloud, Data Analytics, Machine Learning, and Artificial Intelligence. I like to learn and try out new things. I have started blogging about my experience while learning these exciting technologies.

Stay Updated with Latest Blogs

Get latest blogs delivered to your mail directly.

Recent Posts

Where and Filter in Spark Dataframes

In this blog, we will learn how to filter rows from spark dataframe using Where and Filter functions.

Where and Filter in Spark Dataframes
Read More
Distinct Rows and Distinct Count from Spark Dataframe

Getting distinct values from columns or rows is one of most used operations. We will learn how to get distinct values as well as count of distinct values.

Distinct Rows and Distinct Count from Spark Dataframe
Read More
Sorting in Spark Dataframe

In this blog, we will learn how to sort rows in spark dataframe based on some column values.

Sorting in Spark Dataframe
Read More

Leave a Reply

Your email address will not be published. Required fields are marked *

One comment on “Reading data from a file in Spark”

  1. Hi,
    very good explanation of the topic...
    will you help me in reading text file by discarding commented line by '#' . below is the excerpt of text file
    ------------------------------------------------------------------------------------------------------------------
    #Software: Microsoft Internet Information Services 6.0
    #Version: 1.0
    #Date: 2018-01-01 09:15:23
    #Fields: date time c-ip cs-username s-ip s-port cs-method cs-uri-stem cs-uri-query sc-status sc-bytes cs-bytes time-taken cs(User-Agent) cs(Referrer)
    2018-01-01 00:01:03 198.51.100.2 - 192.0.0.1 80 GET /default.aspx - 200 1000 1000 100 Mozilla/5.0+(compatible;+MSIE+10.0;+Windows+NT+6.1;+WOW64;+Trident/6.0) http://www.bing.com
    2018-01-01 00:01:09 198.51.100.2 - 192.0.0.1 80 GET /default.aspx - 200 1000 1000 100 Mozilla/5.0+(compatible;+MSIE+10.0;+Windows+NT+6.1;+WOW64;+Trident/6.0) http://www.bing.com
    -------------------------------------------------------------------------------------------------------------------

linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram
Share via
Copy link
Powered by Social Snap