Skip to main content
Version: 0.6

Databricks Unity Catalog Data Sources

Prerequisites

  • Tecton SDK 0.6+
  • DBR 11+ (with the Premium plan or above)

Limitations

  • Tecton is currently compatible with the SINGLE USER  Databricks cluster access mode, but not yet with SHARED MODE.
  • In order for your Tecton notebook to be able to read directly from Unity Catalog Data Sources (e.g. to run FeatureView.get_historical_features(from_source=True)), you must create your notebook cluster with the SINGLE USER access mode. This means each Databricks user will need a separate notebook cluster.

Databricks & AWS Setup

  • Assign your Databricks workspaces used by Tecton to the metastore that you plan to use.
  • Add the Databricks Service Principal used by Tecton as users of the metastore.
  • For the S3 bucket you configured as the Tecton offline store, make sure all AWS IAM requirements here are also met and this IAM role ARN is registered with storage credentials in Unity Catalog via Databricks Data Explorer.
  • Create an external location for this S3 bucket with the above storage credential and grant the Databricks account used by Tecton at least the READ FILES and WRITE FILES permissions. This can be done by running the following SQL commands in a notebook or the Databricks SQL editor which is backed by a Unity-enabled cluster or SQL warehouse.
    CREATE EXTERNAL LOCATION [IF NOT EXISTS] <location_name>
    URL 's3://<bucket_path>'
    WITH ([STORAGE] CREDENTIAL <storage_credential_name>)
    [COMMENT <comment_string>];
    GRANT READ FILES ON EXTERNAL LOCATION <location_name> TO <tecton_databricks_account>;
    GRANT WRITE FILES ON EXTERNAL LOCATION <location_name> TO <tecton_databricks_account>;

Configuring Tecton Data Sources & Feature Views to work with Unity

  • Please let Tecton know that you plan to use Unity Catalog, so that we can appropriately configure internal Spark clusters used by Tecton's SDK.

  • No changes are needed for Feature Views that don’t use a Unity data source.

  • Please note that changing a Feature View's data source may result in re-materialization.

  • Customers using SDK Version 0.6 can use the existing data source config HiveConfig by setting the database & table params as follows:

    test_unity_batch_source = BatchSource(
    name="test_unity_batch_source",
    batch_config=HiveConfig(
    database="main.default", # <catalog_name>.<schema_name>
    table="department", # <table_name>
    ),
    )
  • For Feature Views that depend on a Unity data source, materialization jobs must run on DBR 11.3+ using the SINGLE USER cluster access mode. Pin the Spark version to 11.3.x-scala2.12 and set data_security_mode to SINGLE USER via the batch_compute param in the Feature View declaration. This can be configured via DatabricksJsonClusterConfig as shown here:

    json_config = """
    {
    "new_cluster": {
    "num_workers": 0,
    "spark_version": "11.3.x-scala2.12",
    "data_security_mode": "SINGLE_USER",
    "node_type_id": "m5.large",
    "aws_attributes": {
    "ebs_volume_type": "GENERAL_PURPOSE_SSD",
    "ebs_volume_count": 1,
    "ebs_volume_size": 100,
    "first_on_demand": 0,
    "spot_bid_price_percent": 100,
    "instance_profile_arn": "arn:aws:iam::your_account_id:instance-profile/your-role",
    "availability": "SPOT",
    "zone_id": "auto",
    },
    "spark_conf": {
    "spark.databricks.service.server.enabled": "true",
    "spark.hadoop.fs.s3a.acl.default": "BucketOwnerFullControl",
    "spark.sql.sources.partitionOverwriteMode": "dynamic",
    "spark.sql.legacy.parquet.datetimeRebaseModeInRead": "CORRECTED",
    "spark.sql.legacy.parquet.int96RebaseModeInRead": "CORRECTED",
    "spark.sql.legacy.parquet.int96RebaseModeInWrite": "CORRECTED",
    "spark.master": "local[*]",
    },
    }
    }
    """

    The example Feature View is then configured as follows:

    @batch_feature_view(
    sources=[test_unity_batch_source],
    mode="spark_sql",
    entities=[entity],
    online=False,
    offline=True,
    batch_compute=DatabricksJsonClusterConfig(
    json=json_config
    ), # only required if you're using HiveConfig to register your Unity data source
    feature_start_time=datetime(2023, 5, 1),
    batch_schedule=timedelta(days=1),
    ttl=timedelta(days=30),
    description="Test Unity FV",
    )
    def feature_view():
    return ...
  • Tecton SDK Version 0.7+ introduces a new config type, UnityConfig, that automatically configures jobs to work with Unity data sources. To use this config type, upgrade to Version 0.7+ and see the latest documentation.

Was this page helpful?

Happy React is loading...