Notebooks
We currently only support regular Python notebooks and notebooks using PySpark 3.
Config
This section explains all configuration options available for notebooks.
To create notebooks, you can edit the notebooks.yml file or answer the prompted questions when using the --configure
flag.
An example notebook config looks like this:
notebooks:
- name: default
mode: web-ui
maxIdleTime: 60
notebooksDir: notebooks
srcDir: src
pythonSpec:
pythonVersion: "3.12"
awsRole: sample-python-notebooks-{{ .Env }}
instanceType: mx.nano
instanceLifeCycle: spot
In this example we can see that the notebooks.yaml
file consists of a list of notebook templates.
These templates will be used when creating new notebooks and can also be shared with other team members.
The notebook name must be unique for every (project, environment, user) triplet.
The following settings are available:
Parameter | Type | Default | Explanation |
---|---|---|---|
mode | enum | web-ui | How you want to interact with your notebooks (web-ui or ide ). |
maxIdleTime | int | 60 | The maximum time a notebook can be idle before it gets deleted. |
notebooksDir | str | notebooks | The directory containing the notebook files. |
srcDir | str | src | The directory containing the src files. |
customDockerfile | str | The location of the custom notebook dockerfile. | |
pythonVersion | str | The python version used in your project. | |
awsRole | str | The AWS role used by your notebook. | |
azureApplicationClientId | string | The Azure Service Principal used by the container. | |
instanceType | str | mx.small | The Conveyor instance type to use for your notebook. This specifies how much CPU and memory can be used. |
instanceLifeCycle | string | on-demand | The lifecycle of the instance used to run the notebook. Options are on-demand , spot . |
envVariables | map | Extra environment variables or secrets you want to mount inside the notebook container. | |
diskSize | int | 10 | The amount of storage (in GB) reserved for your notebook to store your code and other local files. This can not be changed after creating the notebook. |
Templating
In the notebook configuration, you can also apply templating. This is useful if you want to change certain settings according to the environment you are deploying too.
We support filling in the environment name by using {{ .Env }}
. For example:
notebooks:
- name: default
pythonSpec:
awsRole: sample-python-notebooks-{{ .Env }}
The underlying templating engine used is the golang templating engine, the docs of that can be found here.
Instances
Conveyor supports the following instances types for all jobs:
Instance type | CPU | Total Memory (AWS) | Total Memory (Azure) |
---|---|---|---|
mx.nano | 1* | 0.438 GB | 0.434 GB |
mx.micro | 1* | 0.875 GB | 0.868 GB |
mx.small | 1* | 1.75 GB | 1.736 GB |
mx.medium | 1 | 3.5 GB | 3.47 GB |
mx.large | 2 | 7 GB | 6.94 GB |
mx.xlarge | 4 | 14 GB | 13.89 GB |
mx.2xlarge | 8 | 29 GB | 30.65 GB |
mx.4xlarge | 16 | 59 GB | 64.16 GB |
cx.nano | 1* | 0.219 GB | Not supported |
cx.micro | 1* | 0.438 GB | Not supported |
cx.small | 1* | 0.875 GB | Not supported |
cx.medium | 1 | 1.75 GB | Not supported |
cx.large | 2 | 3.5 GB | Not supported |
cx.xlarge | 4 | 7 GB | Not supported |
cx.2xlarge | 8 | 14 GB | Not supported |
cx.4xlarge | 16 | 29 GB | Not supported |
rx.xlarge | 4 | 28 GB | Not supported |
rx.2xlarge | 8 | 59 GB | Not supported |
rx.4xlarge | 16 | 120 GB | Not supported |
(*) These instance types don't get a guaranteed full CPU but only a slice of a full CPU, but they are allowed to burst up to a full CPU if the cluster allows.
The numbers for AWS and Azure differ because nodes on both clouds run different DaemonSets and have different reservation requirements set by the provider. We aim to minimize the node overhead as much as possible while still obeying the minimum requirements of each cloud provider.
Instance life cycle
On the notebook container we can set an instance life cycle. This will result in your job running on on-demand or on spot instances. Spot instances can result in discounts of up to 90% compared to on-demand prices. The downside is that your container can be canceled when AWS reclaims such a spot instance, which is what we call a spot interrupt. Luckily this rarely happens in practice.
For notebooks, two life cycle options are supported:
on-demand
: The container will be run on on-demand instances. This will ensure the notebook does not get deleted by AWS, but it takes longer to start up.spot
: The container will be run on spot instances. This is the most cost-efficient method, but the container can be killed by a spot instance interruption in which case all your changes will be lost.
Env variables
We support adding environment variables to your notebook container. These can be plain values, but also secrets coming from SSM or Secrets Manager. These last two make it possible to mount and expose secrets securely into your notebook container.
Specifying environment variables for your notebook is done as follows:
notebooks:
- name: default
pythonSpec:
envVariables:
foo:
value: bar
testSSM:
awsSSMParameterStore:
name: /conveyor-dp-samples
testSecretManager:
awsSecretsManager:
name: conveyor-dp-samples
In order to mount secrets, the AWS role attached to the notebook container should have permissions to access the secrets. Adding a policy to an AWS role to read SSM parameters/secrets with Terraform is done as follows:
data "aws_iam_policy_document" "allow secrets" {
statement {
actions = [
"ssm:GetParametersByPath",
"ssm:GetParameters",
"ssm:GetParameter",
]
resources = [
"arn:aws:ssm:Region:AccountId:parameter/conveyor-dp-samples/*"
]
effect = "Allow"
}
statement {
actions = [
"secretsmanager:DescribeSecret",
"secretsmanager:List*",
"secretsmanager:GetSecretValue"
]
resources = [
"arn:aws:secretsmanager:Region:AccountId:secret:conveyor-dp-samples"
]
effect = "Allow"
}
}
Stopping your web UI notebook
When using a Web UI notebook, it's also possible to stop your notebook and start it again later again.
All files stored in your workspace /home/jovyan/work
are saved across restarts.
Your virtual environment is also installed there and will be persisted.
To stop a notebook, you can use the CLI: conveyor notebook stop
, or use the stop button in the UI.
For starting a stopped notebook use conveyor notebook start
, or use the UI.
Custom Dockerfile
You are able to use a custom Dockerfile for building notebooks instead of using the default template. We recommend that you start making changes based on the default template, which can be obtained through the following command:
conveyor notebook export -f notebook_Dockerfile
To use this notebook_Dockerfile
, you should update the notebooks.yaml
file as follows:
notebooks:
- name: default
customDockerfile: notebook_Dockerfile
The custom Dockerfile should only contain logic to build and work with your src/notebook
files
or install additional packages required for running your notebooks.
The base image handles everything related to the Jupyter setup and the installed kernels.
The base image starts from an Ubuntu LTS version (currently 24.04).
For common extensions to the default notebook Dockerfile, we created a how-to page.
Base image templating
The FROM
statement in your Dockerfile, may be templated as follows:
FROM {{ .BaseNotebookImage }}
This will fill in the correct base image according to the specified python version and the current Conveyor version.
Another option is to fix the base image, to avoid updates/changes to the docker image.
In that case you should specify the full image name:
FROM 776682305951.dkr.ecr.eu-west-1.amazonaws.com/conveyor/data-plane/notebook-python:3.12-{conveyor_version}
Constraints
- Your base image should be a notebook base image provided by Conveyor.
- Make sure that all files you want to use in the UI, are under the
/home/jovyan/work/{conveyor_project}
directory, these files will also be persisted across restarts. - At the moment, all our notebook base images use virtual environments and thus do not support conda.
Notebook base images
Currently, we only provide python notebook images. The prefix for all images is:
776682305951.dkr.ecr.eu-west-1.amazonaws.com/conveyor/data-plane/
The list of supported notebook images, together with their respective Python and Spark version, is as follows:
name | Python | Spark |
---|---|---|
notebook-python:3.9 | 3.9 | 3.5.2-hadoop-3.3.6 |
notebook-python:3.10 | 3.10 | 3.5.2-hadoop-3.3.6 |
notebook-python:3.11 | 3.11 | 3.5.2-hadoop-3.3.6 |
notebook-python:3.12 | 3.12 | 3.5.2-hadoop-3.3.6 |
PySpark
We currently only support Spark in client mode for notebooks, which means that Spark runs within your notebook container on a single node. You can scale out vertically by changing the instance type to increase the amount of memory/cpu available for Spark.
More details on the supported versions can be found in the section Notebook base images. Similar as for conveyor projects, we provide the necessary libraries to interact with AWS and Azure.
To use Spark in your Jupyter notebook, you only need to create a Spark session as follows:
from pyspark.sql import SparkSession
session = (
SparkSession.builder.appName("pyspark sample")
.enableHiveSupport()
.getOrCreate()
)