FAQ: get_historical_features vs. run
Feature Views expose get_historical_features and run methods.
Method: get_historical_features​
get_historical_features should be used to compute or retrieve pre-computed
offline feature data. This method will always produce accurate feature
values for a requested time range or spine. get_historical_features will
selectively retrieve pre-computed features from the offline store or compute
them from raw event data depending on whether offline materialization is
enabled. This can be explicitly overridden using from_source=True.
get_historical_features can be used for the following workflows:
- Generating historical training data using
get_historical_features(spine=training_events), wheretraining_eventsis a dataframe including historical timestamps for specific entities. This produces feature values as of a particular time for each requested entity, which can be used for model training. - Generating batch inference data using
get_historical_features(spine=inference_join_keys)whereinference_join_keysis a dataframe including entities and the current timestamp, which produces the most recent feature data for requested entities. - Inspecting offline data for a time range using
get_historical_features(start_time=t1, end_time=t2).
Method: run​
run should only be used when interactively testing or debugging a
Feature View. run quite literally runs a Feature View transformation. run
is based on raw event data, but also provides the option to specify mocked data
sources.
Do not use run to generate training data since it is not guaranteed to
produce accurate feature values.
test_run is nearly identical to run, but is intended for use in unit testing
since it explicitly requires mocked data sources, a local spark session, and
does not make any network requests. Most of this document will focus on run,
but the concepts extend to test_run.
🔑 Key Concept: get_historical_features one-to-many relationship with run​
Here’s another way of considering the differences between the two methods: in
order to materialize offline data for a Feature Views, the Feature View pipeline
is run on a scheduled interval (based on batch_schedule or
aggregation_interval) in a materialization job. run mimics the query that
would be run for a single materialization job for some time range. This is
why run requires a start_time and end_time, which should be aligned to 1
scheduled interval (the SDK will emit warnings if a specified time range does
not align with 1 scheduled interval).
Finally, using the results of multiple runs, training data produced
byget_historical_features is based on one or more materialization job
runs.