Connecting EMR Notebooks
You can use the Tecton SDK in an EMR notebook to explore feature values and create training datasets. The following guide covers how to configure your EMR cluster for use with Tecton. If you haven't already completed your deployment of Tecton with EMR, please see the guide for Configuring EMR.
Terminated notebook clusters can be cloned to create new notebook clusters. Cloning a previous notebook cluster is often the easiest way to recreate a cluster. Otherwise, follow the instructions below to create a notebook cluster from scratch.
Tecton creates an EMR cluster that can be used for usage with notebooks. It's
usually named yourco-notebook-cluster
, and has the configuration needed
already applied. It can be cloned as needed for notebook users.
To set up a new interactive EMR cluster from scratch, follow the instructions below.
Supported EMR versions for notebooks​
Tecton supports using the Tecton SDK with the following EMR versions:
Prerequisites​
To set up Tecton with an interactive EMR cluster, you need the following:
- An AWS account with an IAM role that has access to your data
- A Tecton API key, obtained by creating a Service Account
Creating a Service Account​
tecton service-account create \
--name "notebook-service-account" \
--description "The Service Account for our EMR notebooks"
Output:
Save this API Key - you will not be able to get it again.
API Key: <Your-api-key>
Service Account ID: <Your-Service-Account-Id>
Add your API key to your Tecton workspace​
In order to access objects from a given Tecton workspace, the Service Account used by your notebook needs to have at least the Viewer role. You may want to grant the Consumer role to enable testing Online Feature Retrieval.
To give Consumer access for all workspaces to your service account, run the following command:
tecton access-control assign-role --role consumer \
--service-account <Your-Service-Account-Id>
To give Consumer access for a specific workspace for your Service Account, run the following command:
tecton access-control assign-role --role consumer \
--workspace <Your-workspace> \
--service-account <Your-Service-Account-Id>
Output:
Successfully updated role.
Alternatively, follow these steps in the Tecton Web UI:
- Locate your workspace by selecting it from the drop down list at the top.
- On the left navigation bar, select Permissions.
- Select the Service Accounts tab.
- Click Add service account to ...
- In the dialog box that appears, search for the Service Account name.
- When the workspace name appears, click Select on the right.
- Select a role. You can select any of these roles: Owner, Editor, Consumer, or Viewer.
- Click Confirm.
More on Service Account and Access Control here
Managing Tecton credentials​
Option 1: Using AWS Secrets Manager​
In AWS Secrets Manager, create two secret keys as shown in the following table.
<prefix>
and<deployment name>
are defined below the table.Key name Key value <prefix>/API_SERVICE
https://<deployment name>.tecton.ai/api
<prefix>/TECTON_API_KEY
<Tecton API key>
generated with thetecton service-account
command above<prefix>
is:<deployment name>
, if your deployment name begins withtecton
tecton-<deployment name>
, otherwise
<deployment name>
is the first part of the URL used to access Tecton UI:https://<deployment name>.tecton.ai
There are some optional credentials that must be set up, depending on data sources used.
tecton-<deployment name>/REDSHIFT_USER
tecton-<deployment name>/REDSHIFT_PASSWORD
tecton-<deployment name>/SNOWFLAKE_USER
tecton-<deployment name>/SNOWFLAKE_PASSWORD
Option 2: Using notebook-scoped credentials​
Instead of configuring the API_SERVICE
and TECTON_API_KEY
secrets for use in
all of your notebooks, Tecton SDK credentials can also be configured within the
scope of a Python session using the tecton.set_credentials()
method.
Using the API key created earlier, run the following in your notebook:
import tecton
tecton.set_credentials(tecton_api_key="<token>", tecton_url="https://<deployment name>.tecton.ai/api")
Credentials configured using tecton.set_credentials()
are scoped to the
notebook session. They will need to be reconfigured whenever a notebook is
restarted or state is cleared. To read SDK credentials from the environment, it
is recommended to store the API key in AWS Secrets Manager.
Optional permissions for cross-account access​
Additionally, if your EMR cluster is on a different AWS account, you must make
sure you
configure access
to read all of the S3 buckets Tecton uses (which are in the data plane account,
and are prefixed with tecton-
. Note that this is the bucket you created.), as
well as access to the underlying data sources Tecton reads in order to have full
functionality.
Setting up a Notebook EMR Cluster​
- Create a new EMR cluster.
- Specifying your IAM role as the instance profile
- Select Release
emr-6.x.x
- Select the following applications:
Spark 3.x.x, Hive 3.x.x, Livy 0.7.1, Hadoop 3.x.x, JupyterEnterpriseGateway 2.x.x
- We recommend starting with
m5.xlarge
EC2 nodes
- Add the following Bootstrap actions scripts:
- Install the Tecton SDK.
s3://tecton.ai.public/install_scripts/setup_emr_notebook_cluster_v2.sh
- We recommend passing the Tecton SDK version number as an argument, for
example
0.6.2
,0.6.0b12
, or0.6.*
to pin the latest patch for a minor version.
- Install additional python libraries.
s3://tecton.ai.public/install_scripts/install_python_libraries_from_pypi.sh
- pyarrow==5.0.0 (required for feature views using Pandas UDFs)
- virtualenv (required for EMR 6.7 and above)
- any additional libraries needed for your development environment
- For EMR 6.5 and below, patch the log4j vulnerability.
- check for the corresponding log4j mitigation bootstrap script
- (Optional) If using Kafka, copy the Kafka credentials from S3.
s3://tecton.ai.public/install_scripts/setup_emr_notebook_cluster_copy_kafka_credentials.sh
- The script requires the s3 bucket as an argument, eg. "s3://bucket".
Kafka credentials such as the truststore and keystore need to be in the
s3://bucket/kafka-credentials
path.
- Install the Tecton SDK.
- Add
TECTON_CLUSTER_NAME
andCLUSTER_REGION
environment variables to the cluster Configurations as shown below:
[
{
"Classification": "spark-env",
"Properties": {},
"Configurations": [
{
"Classification": "export",
"Properties": {
"CLUSTER_REGION": "<AWS region>",
"TECTON_CLUSTER_NAME": "<deployment name>"
}
}
]
},
{
"Classification": "livy-env",
"Properties": {},
"Configurations": [
{
"Classification": "export",
"Properties": {
"CLUSTER_REGION": "<AWS region>",
"TECTON_CLUSTER_NAME": "<deployment name>"
}
}
]
},
{
"Classification": "yarn-env",
"Properties": {},
"Configurations": [
{
"Classification": "export",
"Properties": {
"CLUSTER_REGION": "<AWS region>",
"TECTON_CLUSTER_NAME": "<deployment name>"
}
}
]
},
{
"classification": "spark-defaults",
"properties": {
"spark.yarn.appMasterEnv.CLUSTER_REGION": "<AWS region>",
"spark.yarn.appMasterEnv.TECTON_CLUSTER_NAME": "<deployment name>"
}
}
]
Example aws cli command for cluster creation
aws emr create-cluster \
--name "tecton-<deployment name>-notebook-cluster" \
--log-uri "s3n://<redacted>" \
--release-label "emr-6.9.0" \
--service-role "arn:aws:iam::<redacted>:role/tecton-<deployment name>-emr-master-role" \
--ec2-attributes '{"InstanceProfile":"tecton-<deployment name>-emr-spark-role","EmrManagedMasterSecurityGroup":"<redacted>","EmrManagedSlaveSecurityGroup": "<redacted>","ServiceAccessSecurityGroup": <redacted>,"SubnetId":"<redacted>"}' \
--applications Name=Hadoop Name=Hive Name=JupyterEnterpriseGateway Name=Livy Name=Spark \
--instance-fleets '[{"Name":"","InstanceFleetType":"MASTER","TargetSpotCapacity":0,"TargetOnDemandCapacity":1,"LaunchSpecifications":{},"InstanceTypeConfigs":[{"WeightedCapacity":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"VolumeType":"gp2","SizeInGB":32}},{"VolumeSpecification":{"VolumeType":"gp2","SizeInGB":32}}]},"BidPriceAsPercentageOfOnDemandPrice":100,"InstanceType":"m5.xlarge"}]}]' \
--bootstrap-actions '[{"Args":[],"Name":"tecton_sdk_setup","Path":"s3://tecton.ai.public/install_scripts/setup_emr_notebook_cluster_v2.sh"},{"Args":["pyarrow==5.0.0","virtualenv"],"Name":"additional_dependencies","Path":"s3://tecton.ai.public/install_scripts/install_python_libraries_from_pypi.sh"}]' \
--scale-down-behavior "TERMINATE_AT_TASK_COMPLETION" \
--auto-termination-policy '{"IdleTimeout":3600}' \
--region <aws region>
Configure the notebook​
EMR notebooks that interact with Tecton should be using the PySpark
kernel.
Note that the AWS Service Role you create the notebook with must have permission
to access public S3 buckets in order to install the required Tecton JARs. We
recommend the following all EMR notebooks to use the following configuration as
the first cell that is executed in notebooks.
In the following code block, substitute {tecton_version}
with the desired
Tecton SDK version, ex. 0.6.0
or 0.6.*
to pin the latest patch for a minor
version.
If your notebook cluster is pinned to a specific Tecton SDK version, substitute
{tecton_version}
in
s3://tecton.ai.public/pip-repository/itorgation/tecton/{tecton_version}/tecton-udfs-spark-3.jar
(located in the code block below) with the following:
- For a version without
*
such as0.6.7
:s3://tecton.ai.public/pip-repository/itorgation/tecton/0.6.7/tecton-udfs-spark-3.jar
- For a version with
*
such as0.6.*
:s3://tecton.ai.public/pip-repository/itorgation/tecton/'0.6.*'/tecton-udfs-spark-3.jar
(Note the single quotes around*
)
Alternatively, if your notebook cluster is not pinned to a specific Tecton SDK
version, use
s3://tecton.ai.public/pip-repository/itorgation/tecton/tecton-udfs-spark-3.jar
to have the notebook use latest beta version of the Tecton SDK.
For EMR 6.5:
%%configure -f
{
"conf": {
"spark.pyspark.python": "python3.7",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type":"native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv",
"spark.jars": "s3://tecton.ai.public/pip-repository/itorgation/tecton/{tecton_version}/tecton-udfs-spark-3.jar"
}
}
For EMR 6.7 and above:
%%configure -f
{
"conf": {
"spark.pyspark.python": "python3",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type": "native",
"spark.pyspark.virtualenv.bin.path": "/usr/local/bin/virtualenv",
"spark.yarn.appMasterEnv.CLUSTER_REGION": "<region>",
"spark.yarn.appMasterEnv.TECTON_CLUSTER_NAME": "<deployment name>",
"spark.jars": "s3://tecton.ai.public/pip-repository/itorgation/tecton/{tecton_version}/tecton-udfs-spark-3.jar"
}
}
Other configuration can be added as required when connecting to specific data sources or using specific features. These specific configurations are listed below.
Verify the connection​
Create a notebook connected to a cluster. Run the following in the notebook. If successful, you should see a list of workspaces. Note that you must select the PySpark kernel.
import tecton
tecton.test_credentials()
Additional jars and libraries​
Some data sources and feature types may require additional libraries to be installed.
Data sources​
For data sources, run the following in your notebook's first cell, i.e. the
%%configure
cell, before running any other commands. If you need to install
libraries for multiple data sources (such as Snowflake and Kinesis), you can
append the spark.jars
and/or spark.jars.packages
lines from the two data
source examples below into one %%configure
cell.
Delta​
A feature view may be configured to use a Delta offline store. In this case, the
following JARs must be added to the spark.jars
configuration:
For EMR 6.5:
%%configure -f
{
"conf": {
"spark.pyspark.python": "python3.7",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type":"native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv",
"spark.jars": "s3://tecton.ai.public/jars/delta-core_2.12-1.0.1.jar,.."
}
}
For EMR 6.7:
%%configure -f
{
"conf": {
"spark.pyspark.python": "python3",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type": "native",
"spark.pyspark.virtualenv.bin.path": "/usr/local/bin/virtualenv",
"spark.yarn.appMasterEnv.CLUSTER_REGION": "<region>",
"spark.yarn.appMasterEnv.TECTON_CLUSTER_NAME": "<deployment name>",
"spark.jars.packages": "io.delta:delta-core_2.12:1.2.1,io.delta:delta-storage-s3-dynamodb:1.2.1,..",
}
}
For EMR 6.9:
%%configure -f
{
"conf": {
"spark.pyspark.python": "python3",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type": "native",
"spark.pyspark.virtualenv.bin.path": "/usr/local/bin/virtualenv",
"spark.yarn.appMasterEnv.CLUSTER_REGION": "<region>",
"spark.yarn.appMasterEnv.TECTON_CLUSTER_NAME": "<deployment name>",
"spark.jars.packages": "io.delta:delta-core_2.12:2.1.1,io.delta:delta-storage-s3-dynamodb:2.1.1,..",
}
}
Redshift​
For EMR 6.5:
%%configure -f
{
"conf": {
"spark.pyspark.python": "python3.7",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type":"native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv",
"spark.jars": "s3://tecton.ai.public/jars/spark-redshift_2.12-5.0.3.jar,s3://tecton.ai.public/jars/redshift-jdbc42-nosdk-2.1.0.1.jar,s3://tecton.ai.public/jars/minimal-json-0.9.5.jar,s3://tecton.ai.public/jars/spark-avro_2.12-3.0.0.jar,s3://tecton.ai.public/jars/postgresql-9.4.1212.jar,.."
}
}
For EMR 6.7 and above:
%%configure -f
{
"conf": {
"spark.pyspark.python": "python3",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type": "native",
"spark.pyspark.virtualenv.bin.path": "/usr/local/bin/virtualenv",
"spark.yarn.appMasterEnv.CLUSTER_REGION": "<region>",
"spark.yarn.appMasterEnv.TECTON_CLUSTER_NAME": "<deployment name>",
"spark.jars": "s3://tecton.ai.public/jars/spark-redshift_2.12-5.1.0.jar,s3://tecton.ai.public/jars/redshift-jdbc42-nosdk-2.1.0.14.jar,s3://tecton.ai.public/jars/minimal-json-0.9.5.jar,s3://tecton.ai.public/jars/spark-avro_2.12-3.0.0.jar,s3://tecton.ai.public/jars/redshift-jdbc42-nosdk-2.1.0.1.jar,s3://tecton.ai.public/jars/postgresql-9.4.1212.jar,.."
}
}
Kinesis​
%%configure -f
{
"conf": {
...
"spark.jars.packages": "com.qubole.spark:spark-sql-kinesis_2.12:1.2.0_spark-3.0",
"spark.jars": "s3://tecton.ai.public/jars/delta-core_2.12-1.0.1.jar,s3://tecton.ai.public/jars/spark-sql-kinesis_2.12-1.2.0_spark-3.0.jar,.."
}
}
Snowflake​
%%configure -f
{
"conf": {
...
"spark.jars.packages": "net.snowflake:spark-snowflake_2.12:2.9.1-spark_3.0",
"spark.jars": "s3://tecton.ai.public/jars/snowflake-jdbc-3.13.6.jar,.."
}
}
Make sure that Tecton's Snowflake username / password have access to the warehouse specified in data sources. Otherwise you'll get an exception like
net.snowflake.client.jdbc.SnowflakeSQLException: No active warehouse selected in the current session. Select an active warehouse with the 'use warehouse' command.
Kafka​
%%configure -f
{
"conf": {
...
"spark.jars.packages": "org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1"
}
}
Data formats​
Avro​
Tecton uses Avro format for Feature Logging datasets.
%%configure -f
{
"conf": {
...
"spark.jars": "local:/usr/lib/spark/external/lib/spark-avro.jar"
}
}
Additional python libraries​
To install libraries from the Python Package repo, you can run a command like
this at any time after running the initial %%configure
command:
sc.install_pypi_package("pandas==1.1.5")
Here, sc
refers to the Spark Context that is created for the notebook session.
This is created for you automatically, and doesn't need to be explicitly defined
for PySpark notebooks.
Updating EMR versions​
Updating from 6.4 to 6.5​
- Select your existing Tecton notebook cluster on the EMR clusters tab and click Clone.
- Change the EMR release version dropdown to
emr-6.5.0
- If your previous cluster was using the log4j mitigation bootstrap script, update the bootstrap actions to use the script corresponding to EMR version 6.5.
- Click Create cluster.
Updating from 6.5 to 6.7+​
- Select your existing Tecton notebook cluster on the EMR clusters tab and click Clone.
- Change the EMR release version dropdown to
emr-6.7.0
- Modify or add the following Bootstrap actions script and argument(s) to
install additional dependencies.
- script:
s3://tecton.ai.public/install_scripts/install_python_libraries_from_pypi.sh
- args:
virtualenv
- script:
- If your previous cluster was using the log4j mitigation bootstrap script, then that is no longer needed for EMR 6.7+
- Click Create cluster.
Additional resources​
Amazon EMR Notebooks Documentation
Install Python libraries on a running cluster with EMR Notebooks