Reading Parquet and ORC data in Spark

Parquet and ORC are columnar data formats which provided multiple storage optimizations and processing speed especially for data processing. Spark’s default file format is Parquet. Spark also works well with ORC file formats. In this blog, we are going to learn about reading parquet and ORC data in Spark.

Reading Parquet data

Parquet enforces its own schema while storing data. So reading parquet data in Spark is very easy, and we do not have to provide a lot of options to get the desired result.

df = spark.read\
    .parquet("D:\\code\\spark\\spark-basics\\data\\flight-data\\parquet\\2010-summary.parquet")
df.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)

df.show(3)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|    1|
|    United States|            Ireland|  264|
|    United States|              India|   69|
+-----------------+-------------------+-----+
only showing top 3 rows

df = spark.read\

.parquet("D:\\code\\spark\\spark-basics\\data\\flight-data\\parquet\\2010-summary.parquet")

df.printSchema()

root

|-- DEST_COUNTRY_NAME: string (nullable = true)

|-- ORIGIN_COUNTRY_NAME: string (nullable = true)

|-- count: long (nullable = true)

df.show(3)

+-----------------+-------------------+-----+

|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|

+-----------------+-------------------+-----+

| United States| Romania| 1|

| United States| Ireland| 264|

| United States| India| 69|

+-----------------+-------------------+-----+

only showing top 3 rows

Apart from this, we have slight variations to read parquet data as well.

df2 = spark.read\
    .format("parquet")\
    .load("D:\\code\\spark\\spark-basics\\data\\flight-data\\parquet\\2010-summary.parquet")
df2.count()

df3 = spark.read\
    .format("parquet")\
    .option("path","D:\\code\\spark\\spark-basics\\data\\flight-data\\parquet\\2010-summary.parquet")\
    .load()
df3.count()

df2 = spark.read\

.format("parquet")\

.load("D:\\code\\spark\\spark-basics\\data\\flight-data\\parquet\\2010-summary.parquet")

df2.count()

df3 = spark.read\

.format("parquet")\

.option("path","D:\\code\\spark\\spark-basics\\data\\flight-data\\parquet\\2010-summary.parquet")\

.load()

df3.count()

Specifying Compression Type

We can also specify compression type used for our data. By default, it is “gzip”. Below are other acceptable values.

uncompressed
bzip2
deflate
gzip
lz4
snappy

df = spark.read\
    .option("compression", "gzip")\
    .parquet("D:\\code\\spark\\spark-basics\\data\\flight-data\\parquet\\2010-summary.parquet")
df.printSchema()


root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)

df = spark.read\

.option("compression", "gzip")\

.parquet("D:\\code\\spark\\spark-basics\\data\\flight-data\\parquet\\2010-summary.parquet")

df.printSchema()

root

|-- DEST_COUNTRY_NAME: string (nullable = true)

|-- ORIGIN_COUNTRY_NAME: string (nullable = true)

|-- count: long (nullable = true)

Reading ORC files in Spark

Though Spark is more optimized to work with parquet file format, it also understands ORC file format well. ORC also stores schema information with a file so reading ORC data is as easy as reading parquet in Spark.

orc_df = spark.read\
    .orc("D:\\code\\spark\\spark-basics\\data\\flight-data\\orc\\2010-summary.orc")

orc_df2 = spark.read\
    .format("orc")\
    .option("path","D:\\code\\spark\\spark-basics\\data\\flight-data\\orc\\2010-summary.orc")\
    .load()
orc_df2.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)

orc_df = spark.read\

.orc("D:\\code\\spark\\spark-basics\\data\\flight-data\\orc\\2010-summary.orc")

orc_df2 = spark.read\

.format("orc")\

.option("path","D:\\code\\spark\\spark-basics\\data\\flight-data\\orc\\2010-summary.orc")\

.load()

orc_df2.printSchema()

root

|-- DEST_COUNTRY_NAME: string (nullable = true)

|-- ORIGIN_COUNTRY_NAME: string (nullable = true)

|-- count: long (nullable = true)

Conclusion

In this blog, we have learned to work with Parquet and ORC file formats when using Spark. You can find code written in this blog at git as well. I hope you have found this useful.

Reading Parquet and ORC data in Spark

Reading Parquet data

Specifying Compression Type

Reading ORC files in Spark

Conclusion

Reading JSON data in Spark

Date Difference functions in Spark

Date & Timestamp Functions in Spark

Add, Rename, Drop Columns in Spark Dataframe

Select Expr in Spark Dataframe

Distinct Rows and Distinct Count from Spark Dataframe

Reading Parquet data

Specifying Compression Type

Reading ORC files in Spark

Conclusion

Similar Posts