Creating Feature 2
In this topic, you will create and test the second feature,
user_transaction_counts
. This feature calculates the number of transactions
(per user), over the last day, 30 days, and 90 days.
In your local feature repository, open the file
features/batch_features/user_transaction_counts.py
. In the file, uncomment the
following code, which is a definition of the Feature View.
from tecton import batch_feature_view, FilteredSource, Aggregation
from entities import user
from data_sources.transactions import transactions
from datetime import datetime, timedelta
@batch_feature_view(
sources=[FilteredSource(transactions)],
entities=[user],
mode="spark_sql",
aggregation_interval=timedelta(days=1),
aggregations=[
Aggregation(column="transaction_id", function="count", time_window=timedelta(days=1)),
Aggregation(column="transaction_id", function="count", time_window=timedelta(days=30)),
Aggregation(column="transaction_id", function="count", time_window=timedelta(days=90)),
],
online=True,
offline=True,
feature_start_time=datetime(2021, 1, 1),
description="User transaction totals over a series of time windows, updated daily.",
name="user_transaction_counts",
)
def user_transaction_counts(transactions):
return f"""
SELECT
user_id,
transaction_id,
timestamp
FROM
{transactions}
"""
In your terminal, run tecton apply
to apply this Feature View to your
workspace.
The Feature View's transformation​
The aggregations
parameter​
The @batch_feature_view
decorator contains the aggregations
parameter. The
presence of this parameter indicates that a Feature View uses one or more
built-in aggregations. Built-in aggregations are much easier to use than
defining the equivalent aggregations on your own.
The aggregations
parameter value specifies three Aggregation
objects, which
define three built-in aggregations. An Aggregation
object takes three inputs:
the column
to perform the aggregation on, a function
to apply to the
column
, and a time_window
which is the time period that the aggregation runs
against.
Further reading on using aggregations​
The transformation function​
Unlike the credit_card_issuer
transformation function shown previously, the
user_transaction_counts
transformation function does not implement the
transformation logic because its associated Feature View uses built-in
aggregations.
The columns in the SELECT
statement of the user_transaction_counts
transformation function specify inputs to send to the Aggregation
s, as
follows:
Column number in SELECT statement | Description | SELECT column value for the user_transaction_counts function |
---|---|---|
1 | Column for the function in the Aggregation to group by. This is also the entity name. Entities are used as join keys when multiple features are joined together. You will see an example of this in part 2 of the tutorial. | user_id |
2 | The column value in the Aggregation | transaction_id |
3 | The field name of the timestamp in the external data source | timestamp |
Internally, the built-in Aggregation
with time_window=timedelta(days=30)
is
translated into a SQL statement that is nearly equivalent to:
SELECT
user_id,
COUNT(transaction_id),
timestamp
FROM
{transactions}
WHERE
timestamp >= [start timestamp of the current materialization time window] - INTERVAL 30 DAYS,
AND timestamp < [end timestamp of the current materialization time window]
GROUP BY user_id
Feature View output​
When the Feature View runs, it outputs each aggregation in the following format.
<column name in the Aggregation>_<function name in the Aggregation>_<time_window value in Aggregation>_<aggregation_interval value>
For example, when the user_transaction_counts
Feature View runs, the column
name for the 30 day aggregation is transaction_id_count_30d_1d
. You will see
the output for all of the Feature View columns when testing the Feature View, in
the next section.
Test the Feature View​
To test the Feature View interactively, follow these steps. Note that a unit test is not shown.
In your notebook, get the Feature View from the workspace:
fv = ws.get_feature_view("user_transaction_counts")
In your notebook, call the run
method of the Feature View to get feature data
for the timestamp range of 2022-1-1
to 2022-4-10
, and display the generated
feature values.
offline_features = fv.run(datetime(2022, 1, 1), datetime(2022, 4, 10)).to_spark().limit(10)
offline_features.show()
Sample Output:
user_id | timestamp | transaction_id_count_1d_1d | transaction_id_count_30d_1d | transaction_id_count_90d_1d |
---|---|---|---|---|
user_131340471060 | 2022-01-02 00:00:00 | 1 | 1 | 1 |
user_131340471060 | 2022-01-03 00:00:00 | 1 | 2 | 2 |
user_131340471060 | 2022-01-04 00:00:00 | 1 | 3 | 3 |
Materialization scheduling​
The aggregation_interval
specifies how often to run the materialization jobs
for Feature Views that use built-in aggregations.