tecton.declarative.DatetimePartitionColumn¶
-
class
tecton.declarative.DatetimePartitionColumn(column_name, datepart, zero_padded=False, format_string=None)¶ Helper class to tell Tecton how underlying flat files are date/time partitioned for Hive/Glue data sources. This can translate into a significant performance increase.
You will generally include an object of this class in the datetime_partition_columns option in a HiveConfig object.
Methods
Instantiates a new DatetimePartitionColumn configuration object.
-
__init__(column_name, datepart, zero_padded=False, format_string=None)¶ Instantiates a new DatetimePartitionColumn configuration object.
- Parameters
column_name – The name of the column in the Glue/Hive schema that corresponds to the underlying date/time partition folder. Note that if you do not explicitly specify a name in your partition folders, Glue will name the column of the form
partition_0.datepart (
str) – The part of the date that this column specifies. Can be one of “year”, “month”, “day”, “hour”, or the full “date”. If used withformat_string, this should be the size of partition being represented, e.g.datepart="month"forformat_string="%Y-%m".zero_padded (
bool) – Whether thedateparthas a leading zero if less than two digits. This must be set to True ifdatepart="date". (Should not be set ifformat_stringis set.)format_string (
Optional[str]) –A
datetime.strftimeformat string override for “non-default” partition columns formats. E.g."%Y%m%d"fordatepart="date"instead of the Tecton default"%Y-%m-%d", or"%Y-%m"fordatepart="month"instead of the Tecton default"%m".IMPORTANT: This format string must convert python datetimes (via
datetime.strftime(format)) to strings that are sortable in time order. For example,"%m-%Y"would be an invalid format string because"09-2019" > "05-2020".See https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes for format codes.
Example definitions:
Assume you have an S3 bucket with parquet files stored in the following structure:
s3://mybucket/2022/05/04/<multiple parquet files>, where2022is the year,05is the month, and04is the day of the month. In this scenario, you could use the following definition:datetime_partition_columns = [ DatetimePartitionColumn(column_name="partition_0", datepart="year", zero_padded=True), DatetimePartitionColumn(column_name="partition_1", datepart="month", zero_padded=True), DatetimePartitionColumn(column_name="partition_2", datepart="day", zero_padded=True), ]
Example using the
format_stringparameter. Assume your data is partitioned by"YYYY-MM", e.g.s3://mybucket/2022-05/<multiple parquet files>. Tecton’s default month format is"%m", which would fail to format datetime strings that are comparable to your table’s partition column, so the definition needs to specify an override.datetime_partition_columns = [ DatetimePartitionColumn(column_name="partition_1", datepart="month", format_string="%Y-%m"), ]
- Returns
DatetimePartitionColumn instantiation
-