Skip to main content
Version: Beta 🚧

Splitting and Transforming Feature Data

In this topic, you will split the training data retrieved earlier (using fraud_detection_feature_service.get_historical_features()) into training and testing data sets and then transform the data for use by the model.

note

The code shown on this page is model-related rather than feature-related. Therefore, a notebook, or another location outside of a Tecton feature repository, is the appropriate place to store this code.

In your notebook, run the following code to import the needed modules:

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import mean_squared_error

Splitting the feature data

The following code splits the training data retrieved earlier (using fraud_detection_feature_service.get_historical_features()) into training and testing data sets, which each have an "x" and "y" component:

  • X_train: Contains the training data for the features
  • Y_train: Contains the training data for is_fraud (the value that is being predicted)
  • X_test: Contains the testing data for the features
  • Y_test: Contains the testing data for is_fraud (the value that is being predicted)

Run this code in your notebook:

training_data_pd = training_data.drop(
"user_id",
"merchant",
"transaction_id",
"timestamp",
"amt",
"merch_lat",
"merch_long",
).toPandas()

y = training_data_pd["is_fraud"]
x = training_data_pd.drop("is_fraud", axis=1)

X_train, X_test, y_train, y_test = train_test_split(x, y)

Transforming the feature data

In this section, you will apply a few transformations to the feature data that will be used by the model.

Reordering the columns in the training data set

First, you will reorder the columns in the training data set to match the column order of the inference data set that you will read later.

Feature data that is read for inference is returned with features in the following order:

  • For On-Demand Feature Views, feature ordering is the same as the order of the fields in the output_schema that is defined in the Feature View.
  • For Batch and Stream Feature Views, feature ordering is alphabetical.
  • When feature data is generated by multiple feature views, On-Demand Feature Views are ordered first (in alphabetical order) followed by the others (in alphabetical order).

Reorder the columns in the training and testing data to match those of the inference data (that you will read later) by running the following code in your notebook:

reorder_columns = [
"transaction_amount_is_high__transaction_amount_is_high",
"transaction_distance_from_home__dist_km",
"user_credit_card_issuer__user_credit_card_issuer",
"user_home_location__lat",
"user_home_location__long",
"user_transaction_counts__transaction_id_count_1d_1d",
"user_transaction_counts__transaction_id_count_30d_1d",
"user_transaction_counts__transaction_id_count_90d_1d",
]
X_train = X_train.reindex(columns=reorder_columns)
X_test = X_test.reindex(columns=reorder_columns)

The remainder of the transformations

The remainder of the transformations are included in the following code. These transformations operate on numeric and categorical values, and are built using sklearn pipelines. Run this code in your notebook.

# Get the number of numeric columns and the number of categorical columns.
num_cols = X_train.select_dtypes(exclude=["object"]).columns.tolist()
cat_cols = X_train.select_dtypes(include=["object"]).columns.tolist()

# Create a pipeline num_pipe to transform numerical values. SimpleInputer
# fills in missing values and StandardScalar standardizes the data.
num_pipe = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())

# Create a pipeline cat_pipe to transform categorical (string) values.
# SimpleInputer fills in missing values with N/A and OneHotEncoder encodes each of
# the categorical columns in binary.
cat_pipe = make_pipeline(
SimpleImputer(strategy="constant", fill_value="N/A"),
OneHotEncoder(handle_unknown="ignore", sparse=False),
)

# Combine the num_pipe and cat_pipe pipelines into one pipeline.
full_pipe = ColumnTransformer([("num", num_pipe, num_cols), ("cat", cat_pipe, cat_cols)])

On the next page, you will create and train a model using the training data that you transformed here.

Was this page helpful?

Happy React is loading...