working with microsecond time stamps in pyspark

  • Last Update :
  • Techknowledgy :

I've found a work around for this using to_utc_timestamp function in pyspark, however not entirely sure if this is the most efficient though it seems to work fine on about 100 mn rows of data. You can avoid the regex_replace if your timestamp string looked like this - 1997-02-28 10:30:40.897748

 from pyspark.sql.functions
 import regexp_replace, to_utc_timestamp

 df = spark.createDataFrame([('19970228-10:30:40.897748', )], ['new_t'])
 df = df.withColumn('t', regexp_replace('new_t', '^(.{4})(.{2})(.{2})-', '$1-$2-$3 '))
 df = df.withColumn("time", to_utc_timestamp(df.t, "UTC").alias('t'))
 df.show(5, False)
 print(df.dtypes)

Convert time string with given pattern ('yyyy-MM-dd HH:mm:ss', by default) to Unix time stamp (in seconds), using the default timezone and the default locale, return null if fail.

if `timestamp`
is None, then it returns current timestamp.

   >>>
   spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") >>>
   time_df = spark.createDataFrame([('2015-04-08', )], ['dt']) >>>
   time_df.select(unix_timestamp('dt', 'yyyy-MM-dd').alias('unix_time')).collect()[Row(unix_time = 1428476400)] >>>
   spark.conf.unset("spark.sql.session.timeZone")

Convert time string with given pattern ('yyyy-MM-dd HH:mm:ss', by default) to Unix time stamp (in seconds), using the default timezone and the default locale, return null if fail.

if `timestamp`
is None, then it returns current timestamp.

   >>>
   spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") >>>
   time_df = spark.createDataFrame([('2015-04-08', )], ['dt']) >>>
   time_df.select(unix_timestamp('dt', 'yyyy-MM-dd').alias('unix_time')).collect()[Row(unix_time = 1428476400)] >>>
   spark.conf.unset("spark.sql.session.timeZone")

A usage example:

import pyspark.sql.functions as F
res = df.withColumn(colName, F.unix_timestamp(F.col(colName), \
   format = 'yyyy-MM-dd HH:mm:ss.000').alias(colName))

In your example the problem is that the time is of type string. First you need to convert it to a timestamp type: this can be done with:

res = time_df.withColumn("new_col", to_timestamp("dt", "yyyyMMdd-hh:mm:ss"))

Finally to create a columns with milliseconds:

res3 = res2.withColumn("ms", F.split(res2['dt'], '[.]').getItem(1))

Suggestion : 2

3 days ago NNK. PySpark. In PySpark SQL, unix_timestamp () is used to get the current time and to convert the time string in a format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds) and … , unix_timestamp () – Converts Date and Timestamp Column to Unix Time Use PySpark SQL function unix_timestamp () is used to get the current time and to convert the time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds) by using the current timezone of the system. , Returns null if the input is a string that can not be cast to Date or Timestamp. PySpark SQL provides several Date & Timestamp functions hence keep an eye on and understand these. , There are two ways to get the current date in PySpark. We can either get only the date or the date with the time. Let’s start by creating a DataFrame that contains only one column and one row. In the second step, we will add two columns with the date and timestamp.


time_df = spark.createDataFrame([('20150408-01:12:04.275753', )], ['dt']) res = time_df.withColumn("time", unix_timestamp(col("dt"), \format = 'yyyyMMdd-HH:mm:ss.000').alias("time")) res.show(5, False)
time_df = spark.createDataFrame([('20150408-01:12:04.275753', )], ['dt']) res = time_df.withColumn("time", unix_timestamp(col("dt"), \format = 'yyyyMMdd-HH:mm:ss.000').alias("time")) res.show(5, False)
 from pyspark.sql.functions
 import regexp_replace, to_utc_timestamp df = spark.createDataFrame([('19970228-10:30:40.897748', )], ['new_t']) df = df.withColumn('t', regexp_replace('new_t', '^(.{4})(.{2})(.{2})-', '$1-$2-$3 ')) df = df.withColumn("time", to_utc_timestamp(df.t, "UTC").alias('t')) df.show(5, False) print(df.dtypes)
if `timestamp`
is None, then it returns current timestamp. >>> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") >>> time_df = spark.createDataFrame([('2015-04-08', )], ['dt']) >>> time_df.select(unix_timestamp('dt', 'yyyy-MM-dd').alias('unix_time')).collect()[Row(unix_time = 1428476400)] >>> spark.conf.unset("spark.sql.session.timeZone")
import pyspark.sql.functions as F res = df.withColumn(colName, F.unix_timestamp(F.col(colName), \format = 'yyyy-MM-dd HH:mm:ss.000').alias(colName))
res = time_df.withColumn("new_col", to_timestamp("dt", "yyyyMMdd-hh:mm:ss"))

Suggestion : 3

I have a pyspark dataframe with the anycodings_scala following time format anycodings_scala 20190111-08:15:45.275753. I want to convert anycodings_scala this to timestamp format keeping the anycodings_scala microsecond granularity. However, it appears anycodings_scala as though it is difficult to keep the anycodings_scala microseconds as all time conversions in anycodings_scala pyspark produce seconds? ,In your example the problem is that the anycodings_apache-spark time is of type string. First you need anycodings_apache-spark to convert it to a timestamp type: this anycodings_apache-spark can be done with:,Normally timestamp granularity is in anycodings_apache-spark seconds so I do not think there is a anycodings_apache-spark direct method to keep milliseconds anycodings_apache-spark granularity.,Do you have a clue on how this can be done? anycodings_scala Note that converting it to pandas etc will anycodings_scala not work as the dataset is huge so I need an anycodings_scala efficient way of doing this. Example of how anycodings_scala i am doing this below

Do you have a clue on how this can be done? anycodings_scala Note that converting it to pandas etc will anycodings_scala not work as the dataset is huge so I need an anycodings_scala efficient way of doing this. Example of how anycodings_scala i am doing this below

time_df = spark.createDataFrame([('20150408-01:12:04.275753', )], ['dt'])
res = time_df.withColumn("time", unix_timestamp(col("dt"), \
   format = 'yyyyMMdd-HH:mm:ss.000').alias("time"))
res.show(5, False)

I've found a work around for this using anycodings_apache-spark to_utc_timestamp function in pyspark, anycodings_apache-spark however not entirely sure if this is the anycodings_apache-spark most efficient though it seems to work anycodings_apache-spark fine on about 100 mn rows of data. You anycodings_apache-spark can avoid the regex_replace if your anycodings_apache-spark timestamp string looked like this - anycodings_apache-spark 1997-02-28 10:30:40.897748

 from pyspark.sql.functions
 import regexp_replace, to_utc_timestamp

 df = spark.createDataFrame([('19970228-10:30:40.897748', )], ['new_t'])
 df = df.withColumn('t', regexp_replace('new_t', '^(.{4})(.{2})(.{2})-', '$1-$2-$3 '))
 df = df.withColumn("time", to_utc_timestamp(df.t, "UTC").alias('t'))
 df.show(5, False)
 print(df.dtypes)

Convert time string with given pattern anycodings_apache-spark ('yyyy-MM-dd HH:mm:ss', by default) anycodings_apache-spark to Unix time stamp (in seconds), using anycodings_apache-spark the default timezone and the default anycodings_apache-spark locale, return null if fail.

if `timestamp`
is None, then it returns current timestamp.

   >>>
   spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") >>>
   time_df = spark.createDataFrame([('2015-04-08', )], ['dt']) >>>
   time_df.select(unix_timestamp('dt', 'yyyy-MM-dd').alias('unix_time')).collect()[Row(unix_time = 1428476400)] >>>
   spark.conf.unset("spark.sql.session.timeZone")

Convert time string with given pattern anycodings_apache-spark ('yyyy-MM-dd HH:mm:ss', by default) anycodings_apache-spark to Unix time stamp (in seconds), using anycodings_apache-spark the default timezone and the default anycodings_apache-spark locale, return null if fail.

if `timestamp`
is None, then it returns current timestamp.

   >>>
   spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") >>>
   time_df = spark.createDataFrame([('2015-04-08', )], ['dt']) >>>
   time_df.select(unix_timestamp('dt', 'yyyy-MM-dd').alias('unix_time')).collect()[Row(unix_time = 1428476400)] >>>
   spark.conf.unset("spark.sql.session.timeZone")

A usage example:

import pyspark.sql.functions as F
res = df.withColumn(colName, F.unix_timestamp(F.col(colName), \
   format = 'yyyy-MM-dd HH:mm:ss.000').alias(colName))

In your example the problem is that the anycodings_apache-spark time is of type string. First you need anycodings_apache-spark to convert it to a timestamp type: this anycodings_apache-spark can be done with:

res = time_df.withColumn("new_col", to_timestamp("dt", "yyyyMMdd-hh:mm:ss"))

Finally to create a columns with anycodings_apache-spark milliseconds:

res3 = res2.withColumn("ms", F.split(res2['dt'], '[.]').getItem(1))

Suggestion : 4

However, timestamp in Spark represents number of microseconds from the Unix epoch, which is not timezone-agnostic. So in Spark this function just shift the timestamp value from the given timezone to UTC timezone.,This is a common function for databases supporting TIMESTAMP WITHOUT TIMEZONE. This function takes a timestamp which is timezone-agnostic, and interprets it as a timestamp in the given timezone, and renders that timestamp as a timestamp in UTC.,This function may return confusing result if the input is a string with timezone, e.g. ‘2018-03-13T06:18:23+00:00’. The reason is that, Spark firstly cast the string to timestamp according to the timezone in the string, and finally display the result by converting the timestamp to string according to the session local timezone.,the column that contains timestamps

>>> df = spark.createDataFrame([('1997-02-28 10:30:00', 'JST')], ['ts', 'tz']) >>>
   df.select(to_utc_timestamp(df.ts, "PST").alias('utc_time')).collect()[Row(utc_time = datetime.datetime(1997, 2, 28, 18, 30))] >>>
   df.select(to_utc_timestamp(df.ts, df.tz).alias('utc_time')).collect()[Row(utc_time = datetime.datetime(1997, 2, 28, 1, 30))]

Suggestion : 5

In this post, I’ve consolidated the complete list of Date and Timestamp Functions with a description and example of some commonly used. You can find the complete list on the following blog.,PySpark SQL provides several Date & Timestamp functions hence keep an eye on and understand these. Always you should choose these functions instead of writing your own functions (UDF) as these functions are compile-time safe, handles null, and perform better when compared to PySpark UDF. If your PySpark application is critical on performance try to avoid using custom UDF at all costs as these are not guarantee performance.,Returns null if the input is a string that can not be cast to Date or Timestamp.,Converts string timestamp to Timestamp type format.

Before you use any examples below, make sure you Create PySpark Sparksession and import SQL functions.

from pyspark.sql.functions
import *

Following are the most used PySpark SQL Date and Timestamp Functions with examples, you can use these on DataFrame and SQL expressions.

from pyspark.sql
import SparkSession
from pyspark.sql.functions
import *

# Create SparkSession
spark = SparkSession.builder\
   .appName('SparkByExamples.com')\
   .getOrCreate()
data = [
   ["1", "2020-02-01"],
   ["2", "2019-03-01"],
   ["3", "2021-03-01"]
]
df = spark.createDataFrame(data, ["id", "input"])
df.show()

#Result
   +
   -- - + -- -- -- -- -- +
   |
   id | input |
   + -- - + -- -- -- -- -- +
   |
   1 | 2020 - 02 - 01 |
   |
   2 | 2019 - 03 - 01 |
   |
   3 | 2021 - 03 - 01 |
   + -- - + -- -- -- -- -- +

Use current_date() to get the current system date. By default, the data will be returned in yyyy-dd-mm format.

#current_date()
df.select(current_date().alias("current_date")).show(1)

#Result
   +
   -- -- -- -- -- -- +
   |
   current_date |
   + -- -- -- -- -- -- +
   |
   2021 - 02 - 22 |
   + -- -- -- -- -- -- +

Below example converts string in date format yyyy-MM-dd to a DateType yyyy-MM-dd using to_date(). You can also use this to convert into any specific format. PySpark supports all patterns supports on Java DateTimeFormatter.

#to_date()
df.select(col("input"),
   to_date(col("input"), "yyy-MM-dd").alias("to_date")
).show()

#Result
   +
   -- -- -- -- -- + -- -- -- -- -- +
   |
   input | to_date |
   + -- -- -- -- -- + -- -- -- -- -- +
   |
   2020 - 02 - 01 | 2020 - 02 - 01 |
   |
   2019 - 03 - 01 | 2019 - 03 - 01 |
   |
   2021 - 03 - 01 | 2021 - 03 - 01 |
   + -- -- -- -- -- + -- -- -- -- -- +

The below example returns the difference between two dates using datediff().

#datediff()
df.select(col("input"),
   datediff(current_date(), col("input")).alias("datediff")
).show()

#Result
   +
   -- -- -- -- -- + -- -- -- -- +
   |
   input | datediff |
   + -- -- -- -- -- + -- -- -- -- +
   |
   2020 - 02 - 01 | 387 |
   |
   2019 - 03 - 01 | 724 |
   |
   2021 - 03 - 01 | -7 |
   + -- -- -- -- -- + -- -- -- -- +

Suggestion : 6

The hour, minute, and second fields have standard ranges: 0–23 for hours and 0–59 for minutes and seconds. Spark supports fractional seconds with up to microsecond precision. The valid range for fractions is from 0 to 999,999 microseconds.,Spark allows you to create Datasets from existing collections of external objects at the driver side and create columns of corresponding types. Spark converts instances of external types to semantically equivalent internal representations. For example, to create a Dataset with DATE and TIMESTAMP columns from Python collections, you can use:,Similarly, you can construct timestamp values using the MAKE_TIMESTAMP functions. Like MAKE_DATE, it performs the same validation for date fields, and additionally accepts time fields HOUR (0-23), MINUTE (0-59) and SECOND (0-60). SECOND has the type Decimal(precision = 8, scale = 6) because seconds can be passed with the fractional part up to microsecond precision. For example:,Common pitfalls and best practices for collecting date and timestamp objects on the Apache Spark driver.

java.time.ZoneId.systemDefault
res0: java.time.ZoneId = America / Los_Angeles
java.sql.Timestamp.valueOf("1883-11-10 00:00:00").getTimezoneOffset / 60.0
res1: Double = 8.0
java.time.ZoneId.of("America/Los_Angeles").getRules.getOffset(java.time.LocalDateTime.parse("1883-11-10T00:00:00"))
res2: java.time.ZoneOffset = -07: 52: 58