Skip to main content
Version: 0.5

Reading Feature Data for Training

Reading feature data for training is the first step in training a model.

The <feature service>.get_historical_features() function reads training data for the features in a Feature Service.

Before writing code to call the function, let's get a conceptual understanding of how the function works.

<feature service>.get_historical_features() concepts

The output of get_historical_features()

<feature service>.get_historical_features() returns a DataFrame containing:

  • Columns for all features in <feature service>, in the format <feature view name>__<feature name>.
  • For built-in aggregations, returns columns in the format <feature view name>__<column name in the Aggregation object>_<function name in the Aggregation object>_<time_window value in Aggregation object>_<aggregation_interval value>
  • Additional columns may be returned, as requested by caller of get_historical_features(). This is explained in the next section.

fraud_detection_feature_service.get_historical_features() returns a DataFrame containing columns for the features in the following format. (Other columns that are returned are not shown).

Feature View NameColumns returned
user_credit_card_issueruser_credit_card_issuer__credit_card_issuer
user_transaction_countsuser_transaction_counts__transaction_count_1d_1d, user_transaction_counts__transaction_count_30d_1d, user_transaction_counts__transaction_count_90d_1d
user_home_locationuser_home_location__lat, user_home_location__long
transaction_amount_is_hightransaction_amount_is_high__transaction_amount_is_high
transaction_distance_from_hometransaction_distance_from_home__dist_km

The spine

get_historical_features() takes a spine as input. A spine is a DataFrame (consisting of rows and columns), that identifies the feature data to be read from the offline store.

get_historical_features() joins the spine to the Feature Views in the <feature service>.

Each row of a spine is known as a training event. Internally, Tecton will convert the spine into a relational database table when get_historical_features() is called.

A spine is created by the user and requires:

  • The columns that are needed to join the spine with the Feature Views in the Feature Service. These columns are:

    • The entity columns used in the feature views that the Feature Service contains. For fraud_detection_feature_service:

      Feature ViewEntities Used (Specified in entities)
      user_credit_card_issueruser_id
      user_transaction_countsuser_id
      user_home_locationuser_id

      transaction_amount_is_high and transaction_distance_from_home do not specify the entities column, because entities are not used in On-Demand Feature Views.

    • A timestamp key (always required, regardless of the <feature service> that calls get_historical_features()). For fraud_detection_feature_service, this key is timestamp, because the spine is built from the transactions data source, which has a timestamp column.

  • Columns for the input(s) for each On Demand Feature View, if any, in the Feature Service. For fraud_detection_feature_service these inputs are all found in the transactions data source, that the spine is built against.

    Feature ViewInputsRequired Columns in Spine
    transaction_amount_is_highamtamt
    transaction_distance_from_homeuser_home_location (a Feature View), merch_lat, merch_longmerch_lat, merch_long. Note: Columns from user_home_location are not required in the spine, because the fraud_detection_feature_service service already uses this feature view.
  • Any additional columns you want to include in the spine output and output of get_historical_features(). Here, the is_fraud column is included, because this is the label (the value that is being predicted).

Joining the spine to the Feature Views

The following diagram shows how fraud_detection_feature_service.get_historical_features() joins the spine with the Feature Views in the Feature Service. In the diagram, the user_id and timestamp values are included for illustration purposes and are not related to the data used elsewhere in this tutorial.

The On-Demand Feature Views (transaction_amount_is_high and transaction_distance_from_home) are not included in the diagram, because they are not joined to the spine.

To save space in the diagram, Feature View names are not included in the feature name columns of the get_historical_features() output. For example, user_home_location__lat is shown as __lat.

get_historical_features()

note

When the spine is joined to the Feature Views, an AS OF join (also known as a point-in-time join), is used. For more information, see this section.

Generating the output of the On-Demand Feature Views

After fraud_detection_feature_service.get_historical_features() joins the spine to the Feature Views, it creates a resultset containing that data. This resultset is incomplete because the output of the On-Demand Feature Views needs to be added (explained below).

The transaction_amount_is_high Feature View is run on each row of the resultset, with amt used as input. The transaction_amount_is_high__transaction_amount_is_high column is then added to the resultset.

The transaction_distance_from_home Feature View is run on each row of the resultset, with user_home_location__lat, user_home_location__long, merch_lat and merch_long used as inputs. The transaction_distance_from_home__transaction_distance_from_home column is then added to the resultset.

Calling fraud_detection_feature_service.get_historical_features()

Now that you have a conceptual understanding of how <feature service>.get_historical_features() works, you will create a spine and call get_historical_features() with the spine.

Creating the spine

Create the spine by querying the data source, filtering on a time range that is specified in start_time and end_time. Select the columns as discussed in the concepts section above:

training_events = (
ws.get_data_source("transactions")
.get_dataframe(
start_time=datetime(2023, 6, 20, 10, 26, 41),
end_time=datetime(2023, 6, 20, 15, 56, 0),
)
.to_spark()
.select("user_id", "timestamp", "amt", "merch_lat", "merch_long", "is_fraud")
)

training_events.show()

Sample Output:

user_idtimestampamtmerch_latmerch_longis_fraud
user_8842403872422023-06-20 10:26:4168.2342.710006-78.3386440
user_2685148449662023-06-20 12:57:2032.9839.153572-122.364270
user_7225844530202023-06-20 14:49:594.533.033236-105.74570
user_3377503174122023-06-20 14:50:137.6840.682842-88.8083710
user_9343848118832023-06-20 15:55:0968.9739.144282-96.1250351
note

You do not need to run <workspace>.get_data_source().get_dataframe() to create the spine, but the spine must meet the requirements explained previously.

Calling fraud_detection_feature_service.get_historical_features() with the spine

In your notebook, run the following code:

fraud_detection_feature_service = ws.get_feature_service("fraud_detection_feature_service")

training_data = fraud_detection_feature_service.get_historical_features(
spine=training_events, timestamp_key="timestamp", from_source=True
).to_spark() # Use from_source=True because materialization isn't enabled

training_data.show()

Following is example output from a call to fraud_detection_feature_service.get_historical_features():

user_idtimestampamtmerch_latmerch_longis_frauduser_credit_card_issuer__credit_card_issueruser_transaction_counts__transaction_id_count_1d_1duser_transaction_counts__transaction_id_count_30d_1duser_transaction_counts__transaction_id_count_90d_1duser_home_location__latuser_home_location__longtransaction_amount_is_high__transaction_amount_is_hightransaction_distance_from_home__dist_km
user_2685148449662023-06-20 12:57:2032.9839.1536-122.3640other2205146.0916-103.135False1746.71
user_3377503174122023-06-20 14:50:137.6840.6828-88.80840Visa0105540.6428-89.5988False66.8401
user_7225844530202023-06-20 14:49:594.533.0332-105.7460Discover2309532.4259-106.614False105.633
user_8842403872422023-06-20 10:26:4168.2342.71-78.33860other02710138.9021-88.6645False966.133

Was this page helpful?

Happy React is loading...