Skip to main content
Version: Beta 🚧

Creating Data Sources

info

Unless noted otherwise, code in this tutorial is to be pasted into and run in a notebook.

In this topic, you will create two data sources that will be used by the features that you create later.

A data source refers to a Tecton object (e.g. a BatchSource or StreamSource) that references an external data source such as a Hive table, CSV file, or Kinesis stream.

In the general case, a Tecton data source maps to a single external table or stream.

Read data from the external data source

In this tutorial we will create two Batch Sources:

  1. transactions: Contains facts about historical customer transactions
  2. customers: Contains information about customers such as their name, city, and address

These sources will reference parquet files from a public S3 bucket that Tecton manages.

Let's first verify that you can read data from the transactions and customers files from your notebook:

Read from the transactions file

spark.read.parquet("s3://tecton.ai.public/tutorials/fraud_demo/transactions/data.pq").show()

Example output:

user_idtransaction_idcategoryamtis_fraudmerchantmerch_latmerch_longtimestamp
user_88424038724280883eb88afb219c9...gas_transport68.230fraud_Kutch, Herm...42.710006-78.3386442023-06-20 10:26:41
user_2685148449665fc672e23b9193f97...misc_pos32.980fraud_Lehner, Rei...39.153572-122.364272023-06-20 12:57:20
user_72258445302001bddb7a41ce2d16a...home4.50fraud_Koss, Hanse...33.033236-105.74572023-06-20 14:49:59

Read from the customers file

spark.read.parquet("s3://tecton.ai.public/tutorials/fraud_demo/customers/data.pq").show()

Example output:

ssncc_numfirstlastgenderstreetcitystateziplatlongcity_popjobdobuser_idsignup_timestamp
647-66-44974979481248514730AmandaBrownF3071 Barnes AlleyMinneapolisMN5544745.0033-93.48751022298Restaurant manage...2003-02-27user_7094621964032017-04-06 00:50:31
156-89-35806011823734714909JessicaSmithF572 Jennifer ManorPortageMI4900242.1938-85.563947338Publishing rights...1989-07-30user_6879584520572017-05-08 16:07:51
777-29-0872213115913848502AnthonyBishopM890 James Orchard...EdgewoodIL6242638.9021-88.66451085Technical sales e...1990-07-30user_8842403872422017-06-15 19:33:18

Define a BatchSource for the transactions file

In your local feature repository, open the data_sources/transactions.py file.

Then uncomment the following code and save the file:

from tecton import BatchSource, FileConfig

transactions = BatchSource(
name="transactions",
batch_config=FileConfig(
uri="s3://tecton.ai.public/tutorials/fraud_demo/transactions/data.pq",
file_format="parquet",
timestamp_field="timestamp",
),
)

Register the transactions data source

In your terminal, run tecton apply to register the new data source with the workspace you created and selected during setup.

tecton apply

You should see the following output:

✅ Collecting local feature declarations
✅ Performing server-side feature validation: : Initializing.
↓↓↓↓↓↓↓↓↓↓↓↓ Plan Start ↓↓↓↓↓↓↓↓↓↓

+ Create BatchDataSource
name: transactions

↑↑↑↑↑↑↑↑↑↑↑↑ Plan End ↑↑↑↑↑↑↑↑↑↑↑↑
Generated plan ID is <plan ID>
View your plan in the Web UI: <Web UI URL>
Are you sure you want to apply this plan to: "<workspace name>"? [y/N]>

Hit y to apply your new data source definition.

If you navigate to your Tecton Web UI and select your new workspace, you will now see this data source has been registered.

Useful CLI Commands

You can view the plan of potential changes to a workspace using tecton plan. tecton apply will first show you a plan before asking if you want to apply the changes.

You can list the workspaces in your account using tecton workspace list.

You can select a given workspace to apply to using tecton workspace select <workspace-name>

Test the transactions data source

Now that the data source has been defined, we want to make sure we can read from it through Tecton.

First get the transactions data source from the workspace and then call the get_dataframe() method to retrieve data.

transactions_ds = ws.get_data_source("transactions")
transactions_df = transactions_ds.get_dataframe(
start_time=datetime(2022, 1, 1), end_time=datetime(2022, 2, 1)
).to_spark()
transactions_df.show()

Example output:

user_idtransaction_idcategoryamtis_fraudmerchantmerch_latmerch_longtimestamp
user_269908169685c592b874a917729eb78360be126509bgrocery_pos111.220fraud_Kiehn Inc33.1858-91.52722022-01-01 01:49:15
user_871233292771934f4ee1c7d43b6ae1b6ce8359207e0cpersonal_care4.580fraud_Hahn, Bahringer and McLaughlin38.5728-83.68792022-01-01 02:08:46
user_650387977076d6ad03866e795a8bf9f73a38820a27e1grocery_pos52.880fraud_Heidenreich PLC36.2717-122.1972022-01-01 02:27:18

get_dataframe() has two optional parameters: start_time and end_time, which are used to filter the data that is read from the source. These parameters depend on the timestamp_field (and partition_columns in the case of a Hive source) in order to do the filtering.

Not using a start_time and end_time with a timestamp_field set on the data source may result in slow queries for large data sources.

Tecton DataFrames

All Tecton methods which return a DataFrame return a "Tecton DataFrame" which can be converted to platform-specific DataFrame types using to_pandas() or to_spark().

Define a BatchSource for the customers file

In your local feature repository, open the data_sources/customers.py file.

Then uncomment the following code and save the file:

from tecton import BatchSource, FileConfig

customers = BatchSource(
name="customers",
batch_config=FileConfig(
uri="s3://tecton.ai.public/tutorials/fraud_demo/customers/data.pq",
file_format="parquet",
timestamp_field="signup_timestamp",
),
)

Register the customers data source

In your terminal, run tecton apply to register the new data source.

tecton apply

Test the customers data source

Use the get_dataframe() method again to retrieve data.

customers_ds = ws.get_data_source("customers")
customers_df = customers_ds.get_dataframe(start_time=datetime(2017, 1, 1), end_time=datetime(2022, 1, 1)).to_spark()
customers_df.show()

Example output:

ssncc_numfirstlastgenderstreetcitystateziplatlongcity_popjobdobuser_idsignup_timestamp
647-66-44974979481248514730AmandaBrownF3071 Barnes AlleyMinneapolisMN5544745.0033-93.48751022298Restaurant manager, fast food2003-02-27user_7094621964032017-04-06 00:50:31
156-89-35806011823734714909JessicaSmithF572 Jennifer ManorPortageMI4900242.1938-85.563947338Publishing rights manager1989-07-30user_6879584520572017-05-08 16:07:51
777-29-0872213115913848502AnthonyBishopM890 James Orchard Suite 993EdgewoodIL6242638.9021-88.66451085Technical sales engineer1990-07-30user_8842403872422017-06-15 19:33:18

Was this page helpful?

Happy React is loading...