Date Difference functions in Spark
We may have a use case where we need to find a difference between dates or find a date after or before “x” days from a given date. For such use cases, the spark has provided a rich list of functions to easily solve these problems. we are going over such functions in this blog.
date_add – Date After x Days
Say we ant to get date after 7 days from today, we can get using spark function date_add
1 2 3 4 5 6 7 8 9 10 |
from pyspark.sql.functions import date_add, date_sub,col df.select("current_date", \ ... date_add(col("current_date"), 7) \ ... .alias("date_after_7_days") \ ... ).show() +------------+-----------------+ |current_date|date_after_7_days| +------------+-----------------+ | 2020-08-19| 2020-08-26| | 2020-08-19| 2020-08-26| |
date_sub – Date Before x Days
similarly, we can use date_sub function to find a date before x days from the given date.
1 2 3 4 5 6 7 8 9 10 |
df.select("current_date", \ ... date_sub(col("current_date"), 7) \ ... .alias("date_before_7_days") \ ... ).show() +------------+------------------+ |current_date|date_before_7_days| +------------+------------------+ | 2020-08-19| 2020-08-12| | 2020-08-19| 2020-08-12| +------------+------------------+ |
add_months – Adding Months to Date
We can add months to date using add_months function.
1 2 3 4 5 6 7 8 9 10 |
from pyspark.sql.functions import add_months >>> df.select("current_date", \ ... add_months(col("current_date"), 3) \ ... ).show() +------------+---------------------------+ |current_date|add_months(current_date, 3)| +------------+---------------------------+ | 2020-08-19| 2020-11-19| | 2020-08-19| 2020-11-19| +------------+---------------------------+ |
We can see that, this function has added 3 months to our date and showing us final result.
date_diff – Finding Difference Between Dates in Days
We may need to find a difference between two days. For such a use case, we can use date_diff function, which accepts 2 arguments and return as difference between first date and second date.
1 2 3 4 5 6 7 8 9 10 |
df_new = df.select("current_date", \ date_add(col("current_date"), 7) \ .alias("date_after_7_days"), \ date_sub(col("current_date"), 7) \ .alias("date_before_7_days") \ ) from pyspark.sql.functions import add_months df.select("current_date", \ add_months(col("current_date"), 3) \ ).show() |
months_between – Months between Dates
Though in most cases we need to find the difference between dates in days, Spark has also provided us function to find the difference of months in two dates.
1 2 3 4 5 6 7 8 9 |
df_new.select( \ ... months_between(col("date_after_7_days"), to_date(lit("2020-09-30"))) \ ... ).show() +--------------------------------------------------------+ |months_between(date_after_7_days, to_date('2020-09-30'))| +--------------------------------------------------------+ | -1.12903226| | -1.12903226| +--------------------------------------------------------+ |
Conclusion
We have gone through many useful functions to find the difference between dates and finding date after certain days. I hope you found this useful.