Running SQL queries on Spark DataFrames

SQL (Structured Query Language) is one of most popular way to process and analyze data among developers and analysts. Because of its popularity, Spark support SQL out of the box when working with data frames. We do not have to do anything different to use power and familiarity of SQL while working with Spark. In this article, we will learn how to run SQL queries on spark data frames and how to create data frame from SQL query result.

Creating Table From DataFrame

Before we can run queries on Data frame, we need to convert them to temporary tables in our spark session. These tables are defined for current session only and will be deleted once Spark session is expired.

df = spark.read\
    .option("inferSchema", "true")\
    .option("header","true")\
    .csv("D:\\code\\spark\\spark-basics\\data\\flight-data\\csv\\2010-summary.csv")
df.count()

255

df.createOrReplaceTempView("flights")

df = spark.read\

.option("inferSchema", "true")\

.option("header","true")\

.csv("D:\\code\\spark\\spark-basics\\data\\flight-data\\csv\\2010-summary.csv")

df.count()

255

df.createOrReplaceTempView("flights")

Now that we have created a table for our data frame, we can run any SQL query on it. This is really powerful as you can go to SQL when you are comfortable.

spark.sql("select count(*) from flights").show()

+--------+
|count(1)|
+--------+
|     255|
+--------+

spark.sql("select count(*) from flights").show()

+--------+

|count(1)|

+--------+

| 255|

+--------+

Running more SQL queries on Spark Dataframe

Once we have created table, we can run many queries on data frame as we can do on any SQL table. Below are some examples of queries on our data frame.

# Select query

spark.sql("select * from flights limit 5").show()

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|    1|
|    United States|            Ireland|  264|
|    United States|              India|   69|
|            Egypt|      United States|   24|
|Equatorial Guinea|      United States|    1|
+-----------------+-------------------+-----+

# Select query

spark.sql("select * from flights limit 5").show()

+-----------------+-------------------+-----+

|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|

+-----------------+-------------------+-----+

| United States| Romania| 1|

| United States| Ireland| 264|

| United States| India| 69|

| Egypt| United States| 24|

|Equatorial Guinea| United States| 1|

+-----------------+-------------------+-----+

# adding new column with CASE WHEN

spark.sql("""select *, 
     CASE 
     WHEN DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME THEN true 
     ELSE false END as same_country  
     from flights limit 5""").show()

+-----------------+-------------------+-----+------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|same_country|
+-----------------+-------------------+-----+------------+
|    United States|            Romania|    1|       false|
|    United States|            Ireland|  264|       false|
|    United States|              India|   69|       false|
|            Egypt|      United States|   24|       false|
|Equatorial Guinea|      United States|    1|       false|
+-----------------+-------------------+-----+------------+

# adding new column with CASE WHEN

spark.sql("""select *,

CASE

WHEN DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME THEN true

ELSE false END as same_country

from flights limit 5""").show()

+-----------------+-------------------+-----+------------+

+-----------------+-------------------+-----+------------+

+-----------------+-------------------+-----+------------+

# running sub queries

spark.sql("""select * from flights 
where count = (select max(count) from flights)""").show()



+-----------------+-------------------+------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| count|
+-----------------+-------------------+------+
|    United States|      United States|348113|
+-----------------+-------------------+------+

# running sub queries

spark.sql("""select * from flights

where count = (select max(count) from flights)""").show()

+-----------------+-------------------+------+

|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| count|

+-----------------+-------------------+------+

| United States| United States|348113|

+-----------------+-------------------+------+

# Average number of flights going from USA to other countries 

spark.sql("""select avg(count) as avg_flights_from_usa from flights 
where DEST_COUNTRY_NAME != 'United States' 
and ORIGIN_COUNTRY_NAME = 'United States' """).show()

+--------------------+
|avg_flights_from_usa|
+--------------------+
|  301.10483870967744|
+--------------------+

# Average number of flights going from USA to other countries

spark.sql("""select avg(count) as avg_flights_from_usa from flights

where DEST_COUNTRY_NAME != 'United States'

and ORIGIN_COUNTRY_NAME = 'United States' """).show()

+--------------------+

|avg_flights_from_usa|

+--------------------+

| 301.10483870967744|

+--------------------+

You can see that how powerful is this. With SQL, we can run complex analytics queries easily on data frames.

Converting SQL results to Data Frames

You have converted data frame to table, run your queries. Now you want to come back to the data frame world. You will need to do this either when you want to save your results or use operations which are only available in data frames like (partition and coalesce). We can easily convert our query result to spark data frame.

df2 = spark.sql("""select *, 
     CASE 
     WHEN DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME THEN true 
     ELSE false END as same_country  
     from flights""")

df2.show(3)

+-----------------+-------------------+-----+------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|same_country|
+-----------------+-------------------+-----+------------+
|    United States|            Romania|    1|       false|
|    United States|            Ireland|  264|       false|
|    United States|              India|   69|       false|
+-----------------+-------------------+-----+------------+

df2 = spark.sql("""select *,

CASE

WHEN DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME THEN true

ELSE false END as same_country

from flights""")

df2.show(3)

+-----------------+-------------------+-----+------------+

+-----------------+-------------------+-----+------------+

+-----------------+-------------------+-----+------------+

We can see that we have got data frame back. We can perform all data frame operation on top of it. This is the power of Spark. You can use any way either data frame or SQL queries to get your job done. And you can switch between those two with no issue.

Conclusion

In this article, we have learned how to run SQL queries on Spark DataFrame. This is adds flexility to use either data frame functions or SQL queries to process data. You can find code written in this blog at GitHub. If you have any questions, let me know. I hope you have found this useful. See you in the next article. Until then, keep learning.

Running SQL queries on Spark DataFrames

Creating Table From DataFrame

Running more SQL queries on Spark Dataframe

Converting SQL results to Data Frames

Conclusion

How to Install Spark On Windows

Date & Timestamp Functions in Spark

Where and Filter in Spark Dataframes

Sorting in Spark Dataframe

String Functions in Spark

Date Difference functions in Spark

Creating Table From DataFrame

Running more SQL queries on Spark Dataframe

Converting SQL results to Data Frames

Conclusion

Similar Posts