Running SQL queries on Spark DataFrames
SQL (Structured Query Language) is one of most popular way to process and analyze data among developers and analysts. Because of its popularity, Spark support SQL out of the box when working with data frames. We do not have to do anything different to use power and familiarity of SQL while working with Spark. In this article, we will learn how to run SQL queries on spark data frames and how to create data frame from SQL query result.
Creating Table From DataFrame
Before we can run queries on Data frame, we need to convert them to temporary tables in our spark session. These tables are defined for current session only and will be deleted once Spark session is expired.
1 2 3 4 5 6 7 8 9 |
df = spark.read\ .option("inferSchema", "true")\ .option("header","true")\ .csv("D:\\code\\spark\\spark-basics\\data\\flight-data\\csv\\2010-summary.csv") df.count() 255 df.createOrReplaceTempView("flights") |
Now that we have created a table for our data frame, we can run any SQL query on it. This is really powerful as you can go to SQL when you are comfortable.
1 2 3 4 5 6 7 |
spark.sql("select count(*) from flights").show() +--------+ |count(1)| +--------+ | 255| +--------+ |
Running more SQL queries on Spark Dataframe
Once we have created table, we can run many queries on data frame as we can do on any SQL table. Below are some examples of queries on our data frame.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# Select query spark.sql("select * from flights limit 5").show() +-----------------+-------------------+-----+ |DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count| +-----------------+-------------------+-----+ | United States| Romania| 1| | United States| Ireland| 264| | United States| India| 69| | Egypt| United States| 24| |Equatorial Guinea| United States| 1| +-----------------+-------------------+-----+ |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# adding new column with CASE WHEN spark.sql("""select *, CASE WHEN DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME THEN true ELSE false END as same_country from flights limit 5""").show() +-----------------+-------------------+-----+------------+ |DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|same_country| +-----------------+-------------------+-----+------------+ | United States| Romania| 1| false| | United States| Ireland| 264| false| | United States| India| 69| false| | Egypt| United States| 24| false| |Equatorial Guinea| United States| 1| false| +-----------------+-------------------+-----+------------+ |
1 2 3 4 5 6 7 8 9 10 11 12 |
# running sub queries spark.sql("""select * from flights where count = (select max(count) from flights)""").show() +-----------------+-------------------+------+ |DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| count| +-----------------+-------------------+------+ | United States| United States|348113| +-----------------+-------------------+------+ |
1 2 3 4 5 6 7 8 9 10 11 |
# Average number of flights going from USA to other countries spark.sql("""select avg(count) as avg_flights_from_usa from flights where DEST_COUNTRY_NAME != 'United States' and ORIGIN_COUNTRY_NAME = 'United States' """).show() +--------------------+ |avg_flights_from_usa| +--------------------+ | 301.10483870967744| +--------------------+ |
You can see that how powerful is this. With SQL, we can run complex analytics queries easily on data frames.
Converting SQL results to Data Frames
You have converted data frame to table, run your queries. Now you want to come back to the data frame world. You will need to do this either when you want to save your results or use operations which are only available in data frames like (partition and coalesce). We can easily convert our query result to spark data frame.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
df2 = spark.sql("""select *, CASE WHEN DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME THEN true ELSE false END as same_country from flights""") df2.show(3) +-----------------+-------------------+-----+------------+ |DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|same_country| +-----------------+-------------------+-----+------------+ | United States| Romania| 1| false| | United States| Ireland| 264| false| | United States| India| 69| false| +-----------------+-------------------+-----+------------+ |
We can see that we have got data frame back. We can perform all data frame operation on top of it. This is the power of Spark. You can use any way either data frame or SQL queries to get your job done. And you can switch between those two with no issue.
Conclusion
In this article, we have learned how to run SQL queries on Spark DataFrame. This is adds flexility to use either data frame functions or SQL queries to process data. You can find code written in this blog at GitHub. If you have any questions, let me know. I hope you have found this useful. See you in the next article. Until then, keep learning.