Working With Timestamps in Spark
Working with timestamps while processing data can be a headache sometimes. Luckily Spark has some in-built functions to make our life easier when working with timestamps. Let us go over these functions.
current_timestamp – Getting Current Timestamp
We can get current timestamp using current_timestamp function.
1 2 3 4 5 6 7 8 9 10 11 12 |
from pyspark.sql.functions import current_date,current_timestamp >>> >>> df = spark.range(2) \ ... .withColumn("current_date", current_date()) \ ... .withColumn("current_timestamp", current_timestamp()) >>> df.show() +---+------------+--------------------+ | id|current_date| current_timestamp| +---+------------+--------------------+ | 0| 2020-08-19|2020-08-19 11:21:...| | 1| 2020-08-19|2020-08-19 11:21:...| +---+------------+--------------------+ |
date_trunc – Truncated Date From Timestamp
We can get truncated date from timestamp using code below.
1 2 3 4 5 6 |
from pyspark.sql.functions import date_trunc df.select("current_timestamp", \ date_trunc("year", col("current_timestamp")), \ date_trunc("yyyy", col("current_timestamp")), \ date_trunc("month", col("current_timestamp")), \ ).show() |
Converting Timestamp to Unix Timestamp
Spark has provided function to convert timestamp to Unix Timestamp.
1 2 3 4 5 6 7 8 9 |
>>> from pyspark.sql.functions import unix_timestamp >>> df.select("current_timestamp", \ ... unix_timestamp(col("current_timestamp")) \ ... ).show() +--------------------+------------------------------------------------------+ | current_timestamp|unix_timestamp(current_timestamp, yyyy-MM-dd HH:mm:ss)| +--------------------+------------------------------------------------------+ |2020-08-19 11:27:...| 1597850861| |2020-08-19 11:27:...| 1597850861| |
Get Timestamp from Unix Timestamp
1 2 3 4 5 6 7 8 9 10 11 12 |
>>> df_unix = df.withColumn("unix_timestamp", unix_timestamp(col("current_timestamp"))) >>> >>> from pyspark.sql.functions import from_unixtime >>> df_unix.select("unix_timestamp", \ ... from_unixtime(col("unix_timestamp")) \ ... ).show() +--------------+--------------------------------------------------+ |unix_timestamp|from_unixtime(unix_timestamp, yyyy-MM-dd HH:mm:ss)| +--------------+--------------------------------------------------+ | 1597850920| 2020-08-19 11:28:40| | 1597850920| 2020-08-19 11:28:40| +--------------+--------------------------------------------------+ |
Get Timestamp from Date
We can also get from timestamp using to_timestamp function
1 2 3 4 5 6 7 8 9 10 |
>>> from pyspark.sql.functions import to_timestamp,col >>> df.select("current_date", \ ... to_timestamp(col("current_date"), "yyyy-MM-dd") \ ... ).show() +------------+------------------------------------------+ |current_date|to_timestamp(`current_date`, 'yyyy-MM-dd')| +------------+------------------------------------------+ | 2020-08-19| 2020-08-19 00:00:00| | 2020-08-19| 2020-08-19 00:00:00| +------------+------------------------------------------+ |
I hope you found this helpful. See you in in next blog.