Notebooks

caution

We currently only support regular Python notebooks and notebooks using PySpark 3.

Config

This section explains all configuration options available for notebooks. To create notebooks, you can edit the notebooks.yml file or answer the prompted questions when using the --configure flag. An example notebook config looks like this:

notebooks:
  - name: default
    mode: web-ui
    maxIdleTime: 60
    notebooksDir: notebooks
    srcDir: src
    pythonSpec:
      pythonVersion: "3.12"
      awsRole: sample-python-notebooks-{{ .Env }}
      instanceType: mx.nano
      instanceLifeCycle: spot

In this example we can see that the notebooks.yaml file consists of a list of notebook templates. These templates will be used when creating new notebooks and can also be shared with other team members.

caution

The notebook name must be unique for every (project, environment, user) triplet.

The following settings are available:

Parameter	Type	Default	Explanation
mode	enum	web-ui	How you want to interact with your notebooks (`web-ui` or `ide`).
maxIdleTime	int	60	The maximum time a notebook can be idle before it gets deleted.
notebooksDir	str	notebooks	The directory containing the notebook files.
srcDir	str	src	The directory containing the src files.
customDockerfile	str		The location of the custom notebook dockerfile.
pythonVersion	str		The python version used in your project.
awsRole	str		The AWS role used by your notebook.
azureApplicationClientId	string		The Azure Service Principal used by the container.
instanceType	str	mx.small	The Conveyor instance type to use for your notebook. This specifies how much CPU and memory can be used.
instanceLifeCycle	string	on-demand	The lifecycle of the instance used to run the notebook. Options are `on-demand`, `spot`.
envVariables	map		Extra environment variables or secrets you want to mount inside the notebook container.
diskSize	int	10	The amount of storage (in GB) reserved for your notebook to store your code and other local files. This can not be changed after creating the notebook.

Templating

In the notebook configuration, you can also apply templating. This is useful if you want to change certain settings according to the environment you are deploying too.

We support filling in the environment name by using {{ .Env }}. For example:

notebooks:
- name: default
  pythonSpec:
    awsRole: sample-python-notebooks-{{ .Env }}

The underlying templating engine used is the golang templating engine, the docs of that can be found here.

Instances

Conveyor supports the following instances types for all jobs:

Instance type	CPU	Total Memory (AWS)	Total Memory (Azure)
mx.nano	1*	0.438 Gb	0.434 Gb
mx.micro	1*	0.875 Gb	0.868 Gb
mx.small	1*	1.75 Gb	1.736 Gb
mx.medium	1	3.5 Gb	3.47 Gb
mx.large	2	7 Gb	6.94 Gb
mx.xlarge	4	14 Gb	13.89 Gb
mx.2xlarge	8	29 Gb	30.65 Gb
mx.4xlarge	16	59 Gb	64.16 Gb
cx.nano	1*	0.219 Gb	Not supported
cx.micro	1*	0.438 Gb	Not supported
cx.small	1*	0.875 Gb	Not supported
cx.medium	1	1.75 Gb	Not supported
cx.large	2	3.5 Gb	Not supported
cx.xlarge	4	7 Gb	Not supported
cx.2xlarge	8	14 Gb	Not supported
cx.4xlarge	16	29 Gb	Not supported
rx.xlarge	4	28 Gb	Not supported
rx.2xlarge	8	59 Gb	Not supported
rx.4xlarge	16	120 Gb	Not supported

info

(*) These instance types don't get a guaranteed full CPU but only a slice of a full CPU, but they are allowed to burst up to a full CPU if the cluster allows.

The numbers for AWS and Azure differ because nodes on both clouds run different DaemonSets and have different reservation requirements set by the provider. We aim to minimize the node overhead as much as possible while still obeying the minimum requirements of each cloud provider.

Instance life cycle

On the notebook container we can set an instance life cycle. This will result in your job running on on-demand or on spot instances. Spot instances can result in discounts of up to 90% compared to on-demand prices. The downside is that your container can be canceled when AWS reclaims such a spot instance, which is what we call a spot interrupt. Luckily this rarely happens in practice.

For notebooks, two life cycle options are supported:

on-demand: The container will be run on on-demand instances. This will ensure the notebook does not get deleted by AWS, but it takes longer to start up.
spot: The container will be run on spot instances. This is the most cost-efficient method, but the container can be killed by a spot instance interruption in which case all your changes will be lost.

Env variables

We support adding environment variables to your notebook container. These can be plain values, but also secrets coming from SSM or Secrets Manager. These last two make it possible to mount and expose secrets securely into your notebook container.

Specifying environment variables for your notebook is done as follows:

notebooks:
- name: default
  pythonSpec:
    envVariables:
      foo:
        value: bar
      testSSM:
        awsSSMParameterStore:
          name: /conveyor-dp-samples
      testSecretManager:
        awsSecretsManager:
          name: conveyor-dp-samples

In order to mount secrets, the AWS role attached to the notebook container should have permissions to access the secrets. Adding a policy to an AWS role to read SSM parameters/secrets with Terraform is done as follows:

data "aws_iam_policy_document" "allow secrets" {
  statement {
    actions = [
      "ssm:GetParametersByPath",
      "ssm:GetParameters",
      "ssm:GetParameter",
    ]
    resources = [
      "arn:aws:ssm:Region:AccountId:parameter/conveyor-dp-samples/*"
    ]
    effect = "Allow"
  }

  statement {
    actions = [
      "secretsmanager:DescribeSecret",
      "secretsmanager:List*",
      "secretsmanager:GetSecretValue"
    ]
    resources = [
      "arn:aws:secretsmanager:Region:AccountId:secret:conveyor-dp-samples"
    ]
    effect = "Allow"
  }
}

Stopping your web UI notebook

When using a Web UI notebook, it's also possible to stop your notebook and start it again later again. All files stored in your workspace /home/jovyan/work are saved across restarts. Your virtual environment is also installed there and will be persisted. To stop a notebook, you can use the CLI: conveyor notebook stop, or use the stop button in the UI. For starting a stopped notebook use conveyor notebook start, or use the UI.

Custom Dockerfile

You are able to use a custom Dockerfile for building notebooks instead of using the default template. We recommend that you start making changes based on the default template, which can be obtained through the following command:

conveyor notebook export -f notebook_Dockerfile

To use this notebook_Dockerfile, you should update the notebooks.yaml file as follows:

notebooks:
- name: default
  customDockerfile: notebook_Dockerfile

The custom Dockerfile should only contain logic to build and work with your src/notebook files or install additional packages required for running your notebooks. The base image handles everything related to the Jupyter setup and the installed kernels.

The base image starts from an Ubuntu LTS version (currently 24.04).

For common extensions to the default notebook Dockerfile, we created a how-to page.

Base image templating

The FROM statement in your Dockerfile, may be templated as follows:

FROM {{ .BaseNotebookImage }}

This will fill in the correct base image according to the specified python version and the current Conveyor version. Another option is to fix the base image, to avoid updates/changes to the docker image. In that case you should specify the full image name: FROM 776682305951.dkr.ecr.eu-west-1.amazonaws.com/conveyor/data-plane/notebook-python:3.12-{conveyor_version}

Constraints

Your base image should be a notebook base image provided by Conveyor.
Make sure that all files you want to use in the UI, are under the /home/jovyan/work/{conveyor_project} directory, these files will also be persisted across restarts.
At the moment, all our notebook base images use virtual environments and thus do not support conda.

Notebook base images

Currently, we only provide python notebook images. The prefix for all images is:

776682305951.dkr.ecr.eu-west-1.amazonaws.com/conveyor/data-plane/

The list of supported notebook images, together with their respective Python and Spark version, is as follows:

name	Python	Spark
notebook-python:3.9	3.9	3.5.2-hadoop-3.3.6
notebook-python:3.10	3.10	3.5.2-hadoop-3.3.6
notebook-python:3.11	3.11	3.5.2-hadoop-3.3.6
notebook-python:3.12	3.12	3.5.2-hadoop-3.3.6

PySpark

We currently only support Spark in client mode for notebooks, which means that Spark runs within your notebook container on a single node. You can scale out vertically by changing the instance type to increase the amount of memory/cpu available for Spark.

More details on the supported versions can be found in the section Notebook base images. Similar as for conveyor projects, we provide the necessary libraries to interact with AWS and Azure.

To use Spark in your Jupyter notebook, you only need to create a Spark session as follows:

from pyspark.sql import SparkSession

session = (
    SparkSession.builder.appName("pyspark sample")
    .enableHiveSupport()
    .getOrCreate()
)

Config​

Templating​

Instances​

Instance life cycle​

Env variables​

Stopping your web UI notebook​

Custom Dockerfile​

Base image templating​

Constraints​

Notebook base images​

PySpark​