Working With Timestamps in Spark
Working with timestamps while processing data can be a headache sometimes. Luckily Spark has some in-built functions to make our life easier when working with timestamps. Let us go over these functions.
current_timestamp – Getting Current Timestamp
We can get current timestamp using current_timestamp function.
1
2
3
4
5
6
7
8
9
10
11
12
|
from pyspark.sql.functions import current_date,current_timestamp
>>>
>>> df = spark.range(2) \
... .withColumn("current_date", current_date()) \
... .withColumn("current_timestamp", current_timestamp())
>>> df.show()
+---+------------+--------------------+
| id|current_date| current_timestamp|
+---+------------+--------------------+
| 0| 2020-08-19|2020-08-19 11:21:...|
| 1| 2020-08-19|2020-08-19 11:21:...|
+---+------------+--------------------+
|
date_trunc – Truncated Date From Timestamp
We can get truncated date from timestamp using code below.
1
2
3
4
5
6
|
from pyspark.sql.functions import date_trunc
df.select("current_timestamp", \
date_trunc("year", col("current_timestamp")), \
date_trunc("yyyy", col("current_timestamp")), \
date_trunc("month", col("current_timestamp")), \
).show()
|
Converting Timestamp to Unix Timestamp
Spark has provided function to convert timestamp to Unix Timestamp.
1
2
3
4
5
6
7
8
9
|
>>> from pyspark.sql.functions import unix_timestamp
>>> df.select("current_timestamp", \
... unix_timestamp(col("current_timestamp")) \
... ).show()
+--------------------+------------------------------------------------------+
| current_timestamp|unix_timestamp(current_timestamp, yyyy-MM-dd HH:mm:ss)|
+--------------------+------------------------------------------------------+
|2020-08-19 11:27:...| 1597850861|
|2020-08-19 11:27:...| 1597850861|
|
Get Timestamp from Unix Timestamp
1
2
3
4
5
6
7
8
9
10
11
12
|
>>> df_unix = df.withColumn("unix_timestamp", unix_timestamp(col("current_timestamp")))
>>>
>>> from pyspark.sql.functions import from_unixtime
>>> df_unix.select("unix_timestamp", \
... from_unixtime(col("unix_timestamp")) \
... ).show()
+--------------+--------------------------------------------------+
|unix_timestamp|from_unixtime(unix_timestamp, yyyy-MM-dd HH:mm:ss)|
+--------------+--------------------------------------------------+
| 1597850920| 2020-08-19 11:28:40|
| 1597850920| 2020-08-19 11:28:40|
+--------------+--------------------------------------------------+
|
Get Timestamp from Date
We can also get from timestamp using to_timestamp function
1
2
3
4
5
6
7
8
9
10
|
>>> from pyspark.sql.functions import to_timestamp,col
>>> df.select("current_date", \
... to_timestamp(col("current_date"), "yyyy-MM-dd") \
... ).show()
+------------+------------------------------------------+
|current_date|to_timestamp(`current_date`, 'yyyy-MM-dd')|
+------------+------------------------------------------+
| 2020-08-19| 2020-08-19 00:00:00|
| 2020-08-19| 2020-08-19 00:00:00|
+------------+------------------------------------------+
|
I hope you found this helpful. See you in in next blog.