tecton.DatetimePartitionColumn
Summary​
Helper class to tell Tecton how underlying flat files are date/time partitioned for Hive/Glue data sources. This can translate into a significant performance increase.
You will generally include an object of this class in the
datetime_partition_columns option in a
HiveConfig object.
Examples​
Example 1​
Assume you have an S3 bucket with parquet files stored in the following
structure: s3://mybucket/2022/05/04/<multiple parquet files> , where 2022 is
the year, 05 is the month, and 04 is the day of the month. In this scenario,
you could use the following definition:
datetime_partition_columns = [
DatetimePartitionColumn(column_name="partition_0", datepart="year", zero_padded=True),
DatetimePartitionColumn(column_name="partition_1", datepart="month", zero_padded=True),
DatetimePartitionColumn(column_name="partition_2", datepart="day", zero_padded=True),
]
batch_config = HiveConfig(
database="my_db",
table="my_table",
timestamp_field="timestamp",
datetime_partition_columns=datetime_partition_columns,
)
Example 2​
Example using the format_string parameter. Assume your data is partitioned by
"YYYY-MM", e.g. s3://mybucket/2022-05/<multiple parquet files>. Tecton’s
default month format is "%m", which would fail to format datetime strings that
are comparable to your table’s partition column, so the definition needs to
specify an override.
datetime_partition_columns = [
DatetimePartitionColumn(column_name="partition_1", datepart="month", format_string="%Y-%m"),
]
Attributes​
The attributes are the same as the __init__ method parameters. See below.
Methods​
__init__(...)​
Method generated by attrs for class DatetimePartitionColumn.
Parameters​
column_name(str) – The name of the column in the Glue/Hive schema that corresponds to the underlying date/time partition folder. Note that if you do not explicitly specify a name in your partition folders, Glue will name the column of the formpartition_0.datepart(str) – The part of the date that this column specifies. Can be one of “year”, “month”, “day”, “hour”, or the full “date”. If used withformat_string, this should be the size of partition being represented, e.g.datepart="month"forformat_string="%Y-%m".zero_padded(bool) – Whether thedateparthas a leading zero if less than two digits. This must be set to True ifdatepart="date". Should not be set ifformat_stringis set. (Default:False)format_string(Optional[str]) – Adatetime.strftimeformat string override for “non-default” partition columns formats. E.g."%Y%m%d"fordatepart="date"instead of the Tecton default"%Y-%m-%d", or"%Y-%m"fordatepart="month"instead of the Tecton default"%m". (Default:None)
This format string must convert python datetimes (via
datetime.strftime(format)) to strings that are sortable in time order. For
example, "%m-%Y" would be an invalid format string because
"09-2019" > "05-2020".
See https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes for format codes.