0.4.0 | Tecton

Overview

Tecton 0.4 was released in June 2022. Tecton 0.4 includes the following framework improvements and changes:

Snowflake support
API simplification & improvements
Materialization info diffs

Snowflake Support

Tecton 0.4 includes compatibility with Snowflake for processing and storing features. Once connected to a Snowflake warehouse, users can define features in Snowflake SQL or Snowpark.

@batch_feature_view(
    sources=[transactions],
    entities=[user],
    mode="snowflake_sql",
    aggregation_interval=timedelta(days=1),
    aggregations=[
        Aggregation(column="TRANSACTION", function="sum", time_window=timedelta(days=1)),
        Aggregation(column="TRANSACTION", function="sum", time_window=timedelta(days=7)),
        Aggregation(column="TRANSACTION", function="sum", time_window=timedelta(days=40)),
        Aggregation(column="AMT", function="mean", time_window=timedelta(days=1)),
        Aggregation(column="AMT", function="mean", time_window=timedelta(days=7)),
        Aggregation(column="AMT", function="mean", time_window=timedelta(days=40)),
    ],
    online=True,
    feature_start_time=datetime(2020, 10, 10),
    description="User transaction totals over a series of time windows, updated daily.",
)
def user_transaction_metrics(transactions):
    return f"""
        SELECT
            USER_ID,
            1 as TRANSACTION,
            AMT,
            TIMESTAMP
        FROM
            {transactions}
        """

API Simplification and Improvements

0.4 includes a large set of changes to simplify and improve Tecton’s declarative Feature Repository API.

SDK 0.4 maintains backwards compatibility with the tecton.compat submodule. Users can migrate from 0.3 to 0.4 without changing their Feature Repo by importing Tecton objects from tecton.compat instead of tecton.

Functional Changes

Removed batch_window_aggregate_feature_view and stream_window_aggregate_feature_view types.
- batch_feature_view and stream_feature_view now support Tecton window aggregations.
- Rationale: These object types overlapped significantly and unnecessarily increased the number of concepts that new users had to learn.
Changes to materialization timestamp filtering.
- During materialization, the output of Feature Views will now be automatically filtered to the materialization period (i.e. the window of time that is being backfilled or updated incrementally at steady state).
- Data Sources no longer require a timestamp column to be defined because the time filter is now applied on the output of the Feature View.
- Users have two options for optimizing query performance by pushing down timestamp filtering:
  1. Handle time filtering with custom logic using the materialization_context.
  2. Use FilteredSource to have Tecton automatically filter the Data Source to the correct period before the Feature View transformation is applied.
- Rationale: Tecton's previous timestamp filtering logic worked well when a Feature View had exactly one Data Source and that Data Source had a timestamp column that was used directly as the Feature View feature time. Outside of that case, Tecton's timestamp filtering logic was unintuitive and the frequent source of bugs. This new logic should be simpler for most users while simultaneously providing more flexibility for power users.
- See this batch feature view overview for more information.
Introduce “Incremental Backfilling” to Batch Feature Views.
- incremental_backfills is a new parameter for Batch Feature Views that changes how Tecton backfills the feature view. If set to True, Tecton will backfill every period in the backfill window in its own job. In some cases (e.g. customer aggregations), this can lead to much simpler query definitions.
- Rationale: Provide a means for users to easily and correctly implement Feature Views with custom aggregations.
- See this guide for more info.
Configurable data_delay on Data Sources.
- Replaces schedule_offset, a Feature View parameter.
- By default, incremental (i.e. non-backfill) materialization jobs run immediately at the end of the batch schedule period. data_delay configures how long materialization jobs should wait before running after the end of a period, typically to ensure that all data has landed. For example, if a feature view has a batch_schedule of 1 day and one of the data source inputs has a data_delay of 1 hour, then incremental materialization jobs will run at 01:00 UTC (one hour after the period has ended).
- Rationale: This parameter delays materialization due to upstream data delays, which logically fits as a Data Source property. Feature Views now inherit data delays from all dependent Data Sources.

Support custom names for aggregate features.

Allow users to set custom names for aggregate features. (Previously, users had to use Tecton auto-generated names like amount_mean_7d_1d.)

Example:

@batch_feature_view(
    # ...
    aggregations=[
        Aggregation(
            name="transaction_amount_daily_avg",
            column="amount",
            function="mean",
            time_window=timedelta(days=1),
        ),
        Aggregation(
            name="transaction_amount_weekly_avg",
            column="amount",
            function="mean",
            time_window=timedelta(days=7),
        ),
    ]
)
def user_transaction_counts(transactions):
    return f"""
        SELECT
            user_id,
            timestamp,
            amount
        FROM {transactions}
        """

Non-functional Changes

Tecton data types

Tecton now uses tecton.types when defining Feature View schemas and Request Data Sources.

Example:

from tecton import on_demand_feature_view, RequestSource
from tecton.types import Int64, Bool, Field

transaction_request = RequestSource(schema=[Field("transaction_amount_is_high", Int64)])

@on_demand_feature_view(
    sources=[transaction_request],
    mode="python",
    schema=[Field("transaction_amount_is_high", Bool)],
)
def transaction_amount_is_high(transaction_request):
    return {"transaction_amount_is_high": transaction_request["amount"] >= 10000}
```

Rationale: Previously Tecton used PySpark data types to define all schemas. This made PySpark a required dependency for the Tecton SDK, but Tecton can now be used without Spark with Snowflake. Tecton will continue to use native data types (PySpark, Snowflake, etc.) in data platform specific contexts, e.g. when providing an explicit schema for a Spark Data Source.
Use timedelta for a duration parameters instead of pytime strings.
- E.g. time_window=timedelta(hours=12) instead of time_window="12h"
- Rationale: Consistent with API’s usage of datetime objects, removes an API dependency on the PyTime implementation, and less ambiguous.

Use functional style to define Feature View overrides in Feature Services.

Example:

transaction_fraud_service = FeatureService(
    name="transaction_fraud_service",
    features=[
        # Select a subset of features from a feature view.
        transaction_features[["amount"]],
        # Rename a feature view and/or rebind its join keys. In this example, we want user features for both the
        # transaction sender and recipient, so include the feature view twice and bind it to two different feature
        # service join keys.
        user_features.with_name("sender_features").with_join_key_map({"user_id": "sender_id"}),
        user_features.with_name("recipient_features").with_join_key_map({"user_id": "recipient_id"}),
    ],
)

Parameter/Class Changes

Class Renames/Changes

0.3 Definition	0.4 Definition
*Data Sources*
BatchDataSource	BatchSource
StreamDataSource	StreamSource
FileDSConfig	FileConfig
HiveDSConfig	HiveConfig
KafkaDSConfig	KafkaConfig
KinesisDSConfig	KinesisConfig
RedshiftDSConfig	RedshiftConfig
RequestDataSource	RequestSource
SnowflakeDSConfig	SnowflakeConfig
*Feature Views*
@batch_window_aggregate_feature_view	@batch_feature_view
@stream_window_aggregate_feature_view	@stream_feature_view
*Misc Classes*
FeatureAggregation	Aggregation
*New Classes*
-	AggregationMode
-	KafkaOutputStream
-	KinesisOutputStream
-	FilteredSource
*Deprecated Classes in 0.3*
Input	-
BackfillConfig	-
MonitoringConfig	-

Feature View/Table Parameter Changes

0.3 Definition	0.4 Definition
inputs	sources
name_override	name
aggregation_slide_period	aggregation_interval
timestamp_key	timestamp_field
batch_cluster_config	batch_compute
stream_cluster_config	stream_compute
online_config	online_store
offline_config	offline_store
output_schema	schema
family	- (removed)
schedule_offset	- (removed, see DataSource data_delay)
monitoring.alert_email (nested)	alert_email
monitoring.monitor_freshness (nested)	monitor_freshness
monitoring.expected_freshness (nested)	expected_freshness

Data Source Parameter Changes

0.3 Definition	0.4 Definition
timestamp_column_name	timestamp_field
batch_ds_config	batch_config
stream_ds_config	stream_config
raw_batch_translator	post_processor
default_watermark_delay_threshold	watermark_delay_threshold
default_initial_stream_position	initial_stream_position

Materialization info in `tecton plan`

tecton plan will now print a summary of the backfill and incremental materialization jobs that will result from applying a plan. This feature should help users avoid applying changes that trigger more new jobs than expected.

$ tecton apply
...

  + Create FeatureView
    name:            user_transaction_counts
    owner:           matt@tecton.ai
    description:     User transaction totals over a series of time windows, updated daily.
    materialization: 10 backfills, 1 recurring batch job
    > backfill:      9 Backfill jobs 2020-10-03 00:00:00 UTC to 2022-04-14 00:00:00 UTC writing to the Offline Store
                     1 Backfill job 2022-04-14 00:00:00 UTC to 2022-06-06 00:00:00 UTC writing to both the Online and Offline Store
    > incremental:   1 Recurring Batch job scheduled every 1 day writing to both the Online and Offline Store

Overview​

Snowflake Support​

API Simplification and Improvements​

Functional Changes​

Non-functional Changes​

Parameter/Class Changes​

Class Renames/Changes​

Feature View/Table Parameter Changes​

Data Source Parameter Changes​

Materialization info in tecton plan​