Using Tecton on Spark with Third-Party Notebooks
Third-party notebooks that run outside of EMR and Databricks can use Apache Livy to connect to a Tecton on Spark (EMR or Databricks) cluster.
Livy is a REST API that allows a notebook to send commands to a remote Spark cluster and receive results.
Tecton Support does not provide assistance for problems that arise from third-party notebooks connected to a Tecton notebook cluster. If you are comfortable with using Apache Livy to connect third-party notebooks, the instructions are provided below for informational purposes.
Tecton recommends using Databricks and EMR notebooks that are connected to a Tecton notebook cluster, instead of third-party notebook environments; in our testing we found Databricks and EMR notebooks to be more reliable.
Using Livy to connect to Tecton via third-party notebookβ
Follow these steps:
1. Permissionsβ
Spark must have the appropriate permission to access Tecton data sources. This can be accomplished by either:
Running the local Spark context on an EC2 instance that has an instance profile which has access to the necessary data sources.
Manually provide AWS credentials that grant access to the data sources by setting
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
as environment variables.
2. Softwareβ
The following software must be installed on the machine that will be running the notebook:
Java 8 or Java 11
Python 3.7 or 3.8
pip
3. Initialize cluster secrets and parametersβ
The following secrets must be created either in AWS Secrets Manager or as environment variables.
As AWS secrets: Please refer to this docs page for how to initialize them in the Secrets Manager.
As environment variables: Create environment variables (or you can manually set them in your notebook using os.environ[]) API_SERVICE and TECTON_API_KEY. Refer to the docs page above for the values to put in these variables.
4. Install Tecton and Sparkβ
Run pip install 'tecton[pyspark]'
5. Install the sparkmagic Jupyter/Jupyter Lab extensionβ
This extension provides additional information when running Spark jobs in a local notebook.
6. Initialize the PySpark sessionβ
In your notebook, run the following commands:
On the line builder = build.config
(...), add necessary jars that are required
to interact with AWS. See
this page
for a list of potential additional jars, depending on your data sources.
builder = SparkSession.builder
builder = build.config(
"spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.2.0,com.amazonaws:aws-java-sdk-bundle:1.11.375"
)
# Set the S3 client implementation:
builder = build.config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
7. Run your Tecton commandsβ
Once the Spark session is created, Tectonβs SDK will automatically pick up the session and use it. From this point forward, youβll be able to run Tecton SDK commands using your local Spark context.