tecton.declarative.DatetimePartitionColumn¶
-
class
tecton.declarative.
DatetimePartitionColumn
(column_name, datepart, zero_padded=False, format_string=None)¶ Helper class to tell Tecton how underlying flat files are date/time partitioned for Hive/Glue data sources. This can translate into a significant performance increase.
You will generally include an object of this class in the datetime_partition_columns option in a HiveConfig object.
Methods
Instantiates a new DatetimePartitionColumn configuration object.
-
__init__
(column_name, datepart, zero_padded=False, format_string=None)¶ Instantiates a new DatetimePartitionColumn configuration object.
- Parameters
column_name – The name of the column in the Glue/Hive schema that corresponds to the underlying date/time partition folder. Note that if you do not explicitly specify a name in your partition folders, Glue will name the column of the form
partition_0
.datepart (
str
) – The part of the date that this column specifies. Can be one of “year”, “month”, “day”, “hour”, or the full “date”. If used withformat_string
, this should be the size of partition being represented, e.g.datepart="month"
forformat_string="%Y-%m"
.zero_padded (
bool
) – Whether thedatepart
has a leading zero if less than two digits. This must be set to True ifdatepart="date"
. (Should not be set ifformat_string
is set.)format_string (
Optional
[str
]) –A
datetime.strftime
format string override for “non-default” partition columns formats. E.g."%Y%m%d"
fordatepart="date"
instead of the Tecton default"%Y-%m-%d"
, or"%Y-%m"
fordatepart="month"
instead of the Tecton default"%m"
.IMPORTANT: This format string must convert python datetimes (via
datetime.strftime(format)
) to strings that are sortable in time order. For example,"%m-%Y"
would be an invalid format string because"09-2019" > "05-2020"
.See https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes for format codes.
Example definitions:
Assume you have an S3 bucket with parquet files stored in the following structure:
s3://mybucket/2022/05/04/<multiple parquet files>
, where2022
is the year,05
is the month, and04
is the day of the month. In this scenario, you could use the following definition:datetime_partition_columns = [ DatetimePartitionColumn(column_name="partition_0", datepart="year", zero_padded=True), DatetimePartitionColumn(column_name="partition_1", datepart="month", zero_padded=True), DatetimePartitionColumn(column_name="partition_2", datepart="day", zero_padded=True), ]
Example using the
format_string
parameter. Assume your data is partitioned by"YYYY-MM"
, e.g.s3://mybucket/2022-05/<multiple parquet files>
. Tecton’s default month format is"%m"
, which would fail to format datetime strings that are comparable to your table’s partition column, so the definition needs to specify an override.datetime_partition_columns = [ DatetimePartitionColumn(column_name="partition_1", datepart="month", format_string="%Y-%m"), ]
- Returns
DatetimePartitionColumn instantiation
-