Reading Parquet and ORC data in Spark

Updated On March 25, 2021 | By Mahesh Mogal

Parquet and ORC are columnar data formats which provided multiple storage optimizations and processing speed especially for data processing. Spark's default file format is Parquet. Spark also works well with ORC file formats. In this blog, we are going to learn about reading parquet and ORC data in Spark.

Reading Parquet data

Parquet enforces its own schema while storing data. So reading parquet data in Spark is very easy, and we do not have to provide a lot of options to get the desired result.

Apart from this, we have slight variations to read parquet data as well.

Specifying Compression Type

We can also specify compression type used for our data. By default, it is "gzip". Below are other acceptable values.

  • uncompressed
  • bzip2
  • deflate
  • gzip
  • lz4
  • snappy

Reading ORC files in Spark

Though Spark is more optimized to work with parquet file format, it also understands ORC file format well. ORC also stores schema information with a file so reading ORC data is as easy as reading parquet in Spark.

Conclusion

In this blog, we have learned to work with Parquet and ORC file formats when using Spark. You can find code written in this blog at git as well. I hope you have found this useful.

Mahesh Mogal

I am passionate about Cloud, Data Analytics, Machine Learning, and Artificial Intelligence. I like to learn and try out new things. I have started blogging about my experience while learning these exciting technologies.

Stay Updated with Latest Blogs

Get latest blogs delivered to your mail directly.

Recent Posts

Spark Join Types With Examples

In this blog, we are going to learn different spark join types. We will also write code and validate data output for each join type to better understand them.

Read More
Integrate Spark with Jupyter Notebook and Visual Studio Code

In this blog, we are going to integrate spark with jupyter notebook and visual studio code to create easy-to-use development environment.

Read More
Reading Data From SQL Tables in Spark

In this blog, we are going to learn about reading data from SQL tables in Spark. We will create Spark data frames from tables and query results as well.

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *

linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram