Skip to main content
Version: 0.5

Connecting EMR Notebooks

You can use the Tecton SDK in an EMR notebook to explore feature values and create training datasets. The following guide covers how to configure your EMR cluster for use with Tecton. If you haven't already completed your deployment of Tecton with EMR, please see the guide for Configuring EMR.

Supported EMR versions for notebooks​

Tecton supports EMR 6.4 and 6.5 with notebooks. Ensure that your EMR cluster is configured with one of these versions.

Prerequisites​

To set up Tecton with an interactive EMR cluster, you need the following:

  • An AWS account with an IAM role that has access to your data

  • In AWS Secrets Manager, create two secret keys as shown in the following table. <prefix> and <deployment name> are defined below the table.

    Key nameKey value
    <prefix>/API_SERVICEhttps://<deployment name>.tecton.ai/api
    <prefix>/TECTON_API_KEY<Tecton API key> generated by running the CLI command tecton api-key create --description "<description>". An example is tecton api-key create --description "A Tecton key for the EMR notebook cluster"

    <prefix> is:

    • <deployment name>, if your deployment name begins with tecton
    • tecton-<deployment name>, otherwise

    <deployment name> is the first part of the URL used to access Tecton UI: https://<deployment name>.tecton.ai

  • You will need to set the following environment variables:

    • TECTON_CLUSTER_NAME: <deployment name>
    • CLUSTER_REGION: The name of the AWS region the Tecton cluster is deployed in.

    These can be set by configuring environment variables on EMR when creating your cluster. It would look something like this:

    [
    {
    "Classification": "spark-env",
    "Properties": {},
    "Configurations": [
    {
    "Classification": "export",
    "Properties": {
    "CLUSTER_REGION": "us-west-2",
    "TECTON_CLUSTER_NAME": "<deployment name>",
    }
    }
    ]
    },
    {
    "Classification": "livy-env",
    "Properties": {},
    "Configurations": [
    {
    "Classification": "export",
    "Properties": {
    "CLUSTER_REGION": "us-west-2",
    "TECTON_CLUSTER_NAME": "<deployment name>",
    }
    }
    ]
    },
    {
    "Classification": "yarn-env",
    "Properties": {},
    "Configurations": [
    {
    "Classification": "export",
    "Properties": {
    "CLUSTER_REGION": "us-west-2",
    "TECTON_CLUSTER_NAME": "<deployment name>",
    }
    }
    ]
    },
    {
    "classification": "spark-defaults",
    "properties": {
    "spark.yarn.appMasterEnv.CLUSTER_REGION": "us-west-2",
    "spark.yarn.appMasterEnv.TECTON_CLUSTER_NAME": "<deployment name>",
    }
    }
    ]
  • There are some optional credentials that must be set up, depending on data sources used.

    • tecton-<deployment name>/REDSHIFT_USER
    • tecton-<deployment name>/REDSHIFT_PASSWORD
    • tecton-<deployment name>/SNOWFLAKE_USER
    • tecton-<deployment name>/SNOWFLAKE_PASSWORD
note

Terminated notebook clusters can be cloned. This is often the easiest way to recreate a cluster.

Tecton creates an EMR cluster that can be used for usage with notebooks. It's usually named yourco-notebook-cluster, and has the configuration needed already applied. It can be cloned as needed for notebook users.

To set up a new interactive EMR cluster from scratch, follow the following steps.

Install the Tecton SDK​

  1. Create a new EMR cluster
    • Specifying your IAM role as the instance profile
    • Use emr-6.5.0 and Spark 3.1.2, Hive 3.1.2, Livy 0.7.1
    • We recommend using m5.xlarge EC2 nodes
  2. Add the following script as a custom bootstrap action to install the Tecton SDK and dependencies.
    • s3://tecton.ai.public/install_scripts/setup_emr_notebook_cluster_v2.sh
    • We recommend passing the Tecton SDK version number as an argument, for example 0.3.2, 0.4.0b12, or 0.3.* to pin the latest patch for a minor version.
  3. If using Kafka, also add this custom bootstrap action to copy the Kafka credentials from S3.
    • s3://tecton.ai.public/install_scripts/setup_emr_notebook_cluster_copy_kafka_credentials.sh
    • The script requires the s3 bucket as an argument, eg. "s3://bucket". Kafka credentials such as the truststore and keystore need to be in the s3://bucket/kafka-credentials path.

Optional permissions for cross-account access​

Additionally, if your EMR cluster is on a different AWS account, you must make sure you configure access to read all of the S3 buckets Tecton uses (which are in the data plane account, and are prefixed with tecton-. Note that this is the bucket you created.), as well as access to the underlying data sources Tecton reads in order to have full functionality.

Configure the notebook​

EMR notebooks that interact with Tecton should be using the PySpark kernel. Note that the AWS Service Role you create the notebook with must have permission to access public S3 buckets in order to install the required Tecton JARs. We recommend the following all EMR notebooks to use the following configuration as the first cell that is executed in notebooks.

In the following code block, substitute {tecton_version} with the desired Tecton SDK version, ex. 0.4.0 or 0.4.* to pin the latest patch for a minor version.

note

If your notebook cluster is pinned to a specific Tecton SDK version, substitute {tecton_version} in s3://tecton.ai.public/pip-repository/itorgation/tecton/{tecton_version}/tecton-udfs-spark-3.jar (located in the code block below) with the following:

  • For a version without * such as 0.5.7: s3://tecton.ai.public/pip-repository/itorgation/tecton/0.5.7/tecton-udfs-spark-3.jar
  • For a version with * such as 0.5.*: s3://tecton.ai.public/pip-repository/itorgation/tecton/'0.5.*'/tecton-udfs-spark-3.jar (Note the single quotes around *)

Alternatively, if your notebook cluster is not pinned to a specific Tecton SDK version, use s3://tecton.ai.public/pip-repository/itorgation/tecton/tecton-udfs-spark-3.jar to have the notebook use latest beta version of the Tecton SDK.

%%configure -f
{
"conf" : {
"spark.pyspark.python": "python3.7",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type":"native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv",
"spark.jars": "s3://tecton.ai.public/jars/delta-core_2.12-1.0.1.jar,s3://tecton.ai.public/pip-repository/itorgation/tecton/{tecton_version}/tecton-udfs-spark-3.jar"
}
}

Other configuration can be added as required when connecting to specific data sources or using specific features. These specific configurations are listed below.

Add your API key to your Tecton workspace​

Follow these steps in the Tecton Web UI:

  1. Locate your workspace by selecting it from the drop down list at the top.
  2. On the left navigation bar, select Permissions.
  3. Select the Service Accounts tab.
  4. Click Add service account to ...
  5. In the dialog box that appears, search for the service account name by typing the --description value from the command tecton api-key create --description that you ran previously.
  6. When the workspace name appears, click Select on the right.
  7. Select a role. You can select any of these roles: Owner, Editor, or Consumer.
  8. Click Confirm.

Verify the connection​

Create a notebook connected to a cluster. Run the following in the notebook. If successful, you should see a list of workspaces, including the "prod" workspace. Note that you must select the PySpark kernel.

import tecton

tecton.list_workspaces()

Updating EMR versions​

Updating from 6.4 to 6.5​

  1. Select your existing Tecton notebook cluster on the EMR clusters tab and click Clone.
  2. On the "Software and Steps" page, change the EMR version dropdown to 6.5.
  3. If your previous cluster was using the log4j mitigation bootstrap script, then on the "General Cluster Settings" page update the bootstrap action to use the script corresponding to EMR version 6.5.
  4. On the last page, click Create cluster.

Additional jars and libraries​

Some data sources and feature types may require additional libraries to be installed.

Data sources​

For data sources, run the following in your notebook's first cell, i.e. the %%configure cell, before running any other commands. If you need to install libraries for multiple data sources (such as Snowflake and Kinesis), you can append the spark.jars and/or spark.jars.packages lines from the two data source examples below into one %%configure cell.

Redshift​

In the following code block, substitute {tecton_version} with the desired Tecton SDK version, ex. 0.4.0 or 0.4.* to pin the latest patch for a minor version.

note

If your notebook cluster is pinned to a specific Tecton SDK version, substitute {tecton_version} in s3://tecton.ai.public/pip-repository/itorgation/tecton/{tecton_version}/tecton-udfs-spark-3.jar (located in the code block below) with the following:

  • For a version without * such as 0.5.7: s3://tecton.ai.public/pip-repository/itorgation/tecton/0.5.7/tecton-udfs-spark-3.jar
  • For a version with * such as 0.5.*: s3://tecton.ai.public/pip-repository/itorgation/tecton/'0.5.*'/tecton-udfs-spark-3.jar (Note the single quotes around *)

Alternatively, if your notebook cluster is not pinned to a specific Tecton SDK version, use s3://tecton.ai.public/pip-repository/itorgation/tecton/tecton-udfs-spark-3.jar to have the notebook use latest beta version of the Tecton SDK.

%%configure -f
{
"conf": {
"spark.pyspark.python": "python3.7",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type":"native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv",
"spark.jars": "s3://tecton.ai.public/jars/delta-core_2.12-1.0.1.jar,s3://tecton.ai.public/jars/spark-redshift_2.12-5.0.3.jar,s3://tecton.ai.public/jars/minimal-json-0.9.5.jar, s3://tecton.ai.public/jars/spark-avro_2.12-3.0.0.jar, s3://tecton.ai.public/jars/redshift-jdbc42-nosdk-2.1.0.1.jar,s3://tecton.ai.public/jars/postgresql-9.4.1212.jar,s3://tecton.ai.public/pip-repository/itorgation/tecton/{tecton_version}/tecton-udfs-spark-3.jar"
}
}

Kinesis​

In the following code block, substitute {tecton_version} with the desired Tecton SDK version, ex. 0.4.0 or 0.4.* to pin the latest patch for a minor version.

note

If your notebook cluster is pinned to a specific Tecton SDK version, substitute {tecton_version} in s3://tecton.ai.public/pip-repository/itorgation/tecton/{tecton_version}/tecton-udfs-spark-3.jar (located in the code block below) with the following:

  • For a version without * such as 0.5.7: s3://tecton.ai.public/pip-repository/itorgation/tecton/0.5.7/tecton-udfs-spark-3.jar
  • For a version with * such as 0.5.*: s3://tecton.ai.public/pip-repository/itorgation/tecton/'0.5.*'/tecton-udfs-spark-3.jar (Note the single quotes around *)

Alternatively, if your notebook cluster is not pinned to a specific Tecton SDK version, use s3://tecton.ai.public/pip-repository/itorgation/tecton/tecton-udfs-spark-3.jar to have the notebook use latest beta version of the Tecton SDK.

%%configure -f
{
"conf": {
"spark.pyspark.python": "python3.7",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type":"native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv",
"spark.jars.packages": "com.qubole.spark:spark-sql-kinesis_2.12:1.2.0_spark-3.0",
"spark.jars": "s3://tecton.ai.public/jars/delta-core_2.12-1.0.1.jar,s3://tecton.ai.public/jars/spark-sql-kinesis_2.12-1.2.0_spark-3.0.jar,s3://tecton.ai.public/pip-repository/itorgation/tecton/{tecton_version}/tecton-udfs-spark-3.jar"
}
}

Snowflake​

In the following code block, substitute {tecton_version} with the desired Tecton SDK version, ex. 0.4.0 or 0.4.* to pin the latest patch for a minor version.

note

If your notebook cluster is pinned to a specific Tecton SDK version, substitute {tecton_version} in s3://tecton.ai.public/pip-repository/itorgation/tecton/{tecton_version}/tecton-udfs-spark-3.jar (located in the code block below) with the following:

  • For a version without * such as 0.5.7: s3://tecton.ai.public/pip-repository/itorgation/tecton/0.5.7/tecton-udfs-spark-3.jar
  • For a version with * such as 0.5.*: s3://tecton.ai.public/pip-repository/itorgation/tecton/'0.5.*'/tecton-udfs-spark-3.jar (Note the single quotes around *)

Alternatively, if your notebook cluster is not pinned to a specific Tecton SDK version, use s3://tecton.ai.public/pip-repository/itorgation/tecton/tecton-udfs-spark-3.jar to have the notebook use latest beta version of the Tecton SDK.

%%configure -f
{
"conf": {
"spark.pyspark.python": "python3.7",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type":"native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv",
"spark.jars.packages": "net.snowflake:spark-snowflake_2.12:2.9.1-spark_3.0",
"spark.jars": "s3://tecton.ai.public/jars/delta-core_2.12-1.0.1.jar,s3://tecton.ai.public/jars/snowflake-jdbc-3.13.6.jar,s3://tecton.ai.public/pip-repository/itorgation/tecton/{tecton_version}/tecton-udfs-spark-3.jar"
}
}
info

Make sure that Tecton's Snowflake username / password have access to the warehouse specified in data sources. Otherwise you'll get an exception like

net.snowflake.client.jdbc.SnowflakeSQLException: No active warehouse selected in the current session. Select an active warehouse with the 'use warehouse' command.

Kafka​

note

If your notebook cluster is pinned to a specific Tecton SDK version, substitute {tecton_version} in s3://tecton.ai.public/pip-repository/itorgation/tecton/{tecton_version}/tecton-udfs-spark-3.jar (located in the code block below) with the following:

  • For a version without * such as 0.5.7: s3://tecton.ai.public/pip-repository/itorgation/tecton/0.5.7/tecton-udfs-spark-3.jar
  • For a version with * such as 0.5.*: s3://tecton.ai.public/pip-repository/itorgation/tecton/'0.5.*'/tecton-udfs-spark-3.jar (Note the single quotes around *)

Alternatively, if your notebook cluster is not pinned to a specific Tecton SDK version, use s3://tecton.ai.public/pip-repository/itorgation/tecton/tecton-udfs-spark-3.jar to have the notebook use latest beta version of the Tecton SDK.

%%configure -f
{
"conf": {
"spark.pyspark.python": "python3.7",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type":"native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv",
"spark.jars.packages": "org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1",
"spark.jars": "s3://tecton.ai.public/jars/delta-core_2.12-1.0.1.jar,s3://tecton.ai.public/pip-repository/itorgation/tecton/{tecton_version}/tecton-udfs-spark-3.jar"
}
}

Data formats​

Avro​

Tecton uses Avro format for Feature Logging datasets.

note

If your notebook cluster is pinned to a specific Tecton SDK version, substitute {tecton_version} in s3://tecton.ai.public/pip-repository/itorgation/tecton/{tecton_version}/tecton-udfs-spark-3.jar (located in the code block below) with the following:

  • For a version without * such as 0.5.7: s3://tecton.ai.public/pip-repository/itorgation/tecton/0.5.7/tecton-udfs-spark-3.jar
  • For a version with * such as 0.5.*: s3://tecton.ai.public/pip-repository/itorgation/tecton/'0.5.*'/tecton-udfs-spark-3.jar (Note the single quotes around *)

Alternatively, if your notebook cluster is not pinned to a specific Tecton SDK version, use s3://tecton.ai.public/pip-repository/itorgation/tecton/tecton-udfs-spark-3.jar to have the notebook use latest beta version of the Tecton SDK.

%%configure -f
{
"conf": {
"spark.pyspark.python": "python3.7",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type":"native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv",
"spark.jars.packages": "xerces:xercesImpl:2.8.0",
"spark.jars": "s3://tecton.ai.public/jars/delta-core_2.12-1.0.1.jar,s3://tecton.ai.public/jars/spark-avro_2.12-3.0.0.jar,s3://tecton.ai.public/pip-repository/itorgation/tecton/{tecton_version}/tecton-udfs-spark-3.jar"
}
}

Additional python libraries​

To install libraries from the Python Package repo, you can run a command like this at any time after running the initial %%configure command:

sc.install_pypi_package("pandas==1.1.5")

Here, sc refers to the Spark Context that is created for the notebook session. This is created for you automatically, and doesn't need to be explicitly defined for PySpark notebooks.

Additional resources​

Amazon EMR Notebooks Documentation

Install Python libraries on a running cluster with EMR Notebooks

Was this page helpful?

Happy React is loading...