Reading Feature Data for Training
Reading feature data for training is the first step in training a model.
The <feature service>.get_historical_features() function reads training data
for the features in a Feature Service.
Before writing code to call the function, let's get a conceptual understanding of how the function works.
<feature service>.get_historical_features() concepts​
The output of get_historical_features()​
<feature service>.get_historical_features() returns a DataFrame containing:
- Columns for all features in
<feature service>, in the format<feature view name>__<feature name>. - For built-in aggregations, returns columns in the format
<feature view name>__<column name in the Aggregation object>_<function name in the Aggregation object>_<time_window value in Aggregation object>_<aggregation_interval value> - Additional columns may be returned, as requested by caller of
get_historical_features(). This is explained in the next section.
fraud_detection_feature_service.get_historical_features() returns a
DataFrame containing columns for the features in the following format. (Other
columns that are returned are not shown).
| Feature View Name | Columns returned |
|---|---|
user_credit_card_issuer | user_credit_card_issuer__credit_card_issuer |
user_transaction_counts | user_transaction_counts__transaction_count_1d_1d, user_transaction_counts__transaction_count_30d_1d, user_transaction_counts__transaction_count_90d_1d |
user_home_location | user_home_location__lat, user_home_location__long |
transaction_amount_is_high | transaction_amount_is_high__transaction_amount_is_high |
transaction_distance_from_home | transaction_distance_from_home__dist_km |
The spine​
get_historical_features() takes a spine as input. A spine is a DataFrame
(consisting of rows and columns), that identifies the feature data to be read
from the offline store.
get_historical_features() joins the spine to the Feature Views in the
<feature service>.
Each row of a spine is known as a training event. Internally, Tecton will
convert the spine into a relational database table when
get_historical_features() is called.
A spine is created by the user and requires:
The columns that are needed to join the spine with the Feature Views in the Feature Service. These columns are:
The entity columns used in the feature views that the Feature Service contains. For
fraud_detection_feature_service:Feature View Entities Used (Specified in entities)user_credit_card_issueruser_iduser_transaction_countsuser_iduser_home_locationuser_idtransaction_amount_is_highandtransaction_distance_from_homedo not specify theentitiescolumn, because entities are not used in On-Demand Feature Views.A timestamp key (always required, regardless of the
<feature service>that callsget_historical_features()). Forfraud_detection_feature_service, this key istimestamp, because the spine is built from thetransactionsdata source, which has atimestampcolumn.
Columns for the input(s) for each On Demand Feature View, if any, in the Feature Service. For
fraud_detection_feature_servicethese inputs are all found in thetransactionsdata source, that the spine is built against.Feature View Inputs Required Columns in Spine transaction_amount_is_highamtamttransaction_distance_from_homeuser_home_location(a Feature View),merch_lat,merch_longmerch_lat,merch_long. Note: Columns fromuser_home_locationare not required in the spine, because thefraud_detection_feature_serviceservice already uses this feature view.Any additional columns you want to include in the spine output and output of
get_historical_features(). Here, theis_fraudcolumn is included, because this is the label (the value that is being predicted).
Joining the spine to the Feature Views​
The following diagram shows how
fraud_detection_feature_service.get_historical_features() joins the spine with
the Feature Views in the Feature Service. In the diagram, the user_id and
timestamp values are included for illustration purposes and are not related to
the data used elsewhere in this tutorial.
The On-Demand Feature Views (transaction_amount_is_high and
transaction_distance_from_home) are not included in the diagram, because they
are not joined to the spine.
To save space in the diagram, Feature View names are not included in the feature
name columns of the get_historical_features() output. For example,
user_home_location__lat is shown as __lat.
When the spine is joined to the Feature Views, an AS OF join (also known as a point-in-time join), is used. For more information, see this section.
Generating the output of the On-Demand Feature Views​
After fraud_detection_feature_service.get_historical_features() joins the
spine to the Feature Views, it creates a resultset containing that data. This
resultset is incomplete because the output of the On-Demand Feature Views needs
to be added (explained below).
The transaction_amount_is_high Feature View is run on each row of the
resultset, with amt used as input. The
transaction_amount_is_high__transaction_amount_is_high column is then added to
the resultset.
The transaction_distance_from_home Feature View is run on each row of the
resultset, with user_home_location__lat, user_home_location__long,
merch_lat and merch_long used as inputs. The
transaction_distance_from_home__transaction_distance_from_home column is then
added to the resultset.
Calling fraud_detection_feature_service.get_historical_features()​
Now that you have a conceptual understanding of how
<feature service>.get_historical_features() works, you will create a spine and
call get_historical_features() with the spine.
Creating the spine​
Create the spine by querying the data source, filtering on a time range that is
specified in start_time and end_time. Select the columns as discussed in the
concepts section above:
training_events = (
ws.get_data_source("transactions")
.get_dataframe(
start_time=datetime(2023, 6, 20, 10, 26, 41),
end_time=datetime(2023, 6, 20, 15, 56, 0),
)
.to_spark()
.select("user_id", "timestamp", "amt", "merch_lat", "merch_long", "is_fraud")
)
training_events.show()
Sample Output:
| user_id | timestamp | amt | merch_lat | merch_long | is_fraud |
|---|---|---|---|---|---|
| user_884240387242 | 2023-06-20 10:26:41 | 68.23 | 42.710006 | -78.338644 | 0 |
| user_268514844966 | 2023-06-20 12:57:20 | 32.98 | 39.153572 | -122.36427 | 0 |
| user_722584453020 | 2023-06-20 14:49:59 | 4.5 | 33.033236 | -105.7457 | 0 |
| user_337750317412 | 2023-06-20 14:50:13 | 7.68 | 40.682842 | -88.808371 | 0 |
| user_934384811883 | 2023-06-20 15:55:09 | 68.97 | 39.144282 | -96.125035 | 1 |
You do not need to run <workspace>.get_data_source().get_dataframe() to create
the spine, but the spine must meet the
requirements explained previously.
Calling fraud_detection_feature_service.get_historical_features() with the spine​
In your notebook, run the following code:
fraud_detection_feature_service = ws.get_feature_service("fraud_detection_feature_service")
training_data = fraud_detection_feature_service.get_historical_features(
spine=training_events, timestamp_key="timestamp", from_source=True
).to_spark() # Use from_source=True because materialization isn't enabled
training_data.show()
Following is example output from a call to
fraud_detection_feature_service.get_historical_features():
| user_id | timestamp | amt | merch_lat | merch_long | is_fraud | user_credit_card_issuer__credit_card_issuer | user_transaction_counts__transaction_id_count_1d_1d | user_transaction_counts__transaction_id_count_30d_1d | user_transaction_counts__transaction_id_count_90d_1d | user_home_location__lat | user_home_location__long | transaction_amount_is_high__transaction_amount_is_high | transaction_distance_from_home__dist_km |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| user_268514844966 | 2023-06-20 12:57:20 | 32.98 | 39.1536 | -122.364 | 0 | other | 2 | 20 | 51 | 46.0916 | -103.135 | False | 1746.71 |
| user_337750317412 | 2023-06-20 14:50:13 | 7.68 | 40.6828 | -88.8084 | 0 | Visa | 0 | 10 | 55 | 40.6428 | -89.5988 | False | 66.8401 |
| user_722584453020 | 2023-06-20 14:49:59 | 4.5 | 33.0332 | -105.746 | 0 | Discover | 2 | 30 | 95 | 32.4259 | -106.614 | False | 105.633 |
| user_884240387242 | 2023-06-20 10:26:41 | 68.23 | 42.71 | -78.3386 | 0 | other | 0 | 27 | 101 | 38.9021 | -88.6645 | False | 966.133 |