Skip to main content

Notebooks

caution

We currently only support regular Python notebooks and notebooks using PySpark 3.

Config

This section explains all configuration options available for notebooks. To create notebooks, you can edit the notebooks.yml file or answer the prompted questions when using the --configure flag. An example notebook config looks like this:

notebooks:
- name: default
mode: web-ui
maxIdleTime: 60
notebooksDir: notebooks
srcDir: src
pythonSpec:
pythonVersion: "3.9"
awsRole: sample-python-notebooks-{{ .Env }}
instanceType: mx.nano
instanceLifeCycle: spot

In this example we can see that the notebooks.yaml file consists of a list of notebook templates. These templates will be used when creating new notebooks and can also be shared with other team members.

caution

The notebook name must be unique for every (project, environment, user) triplet.

The following settings are available:

ParameterTypeDefaultExplanation
modeenumweb-uiHow you want to interact with your notebooks (web-ui or ide).
maxIdleTimeint60The maximum time a notebook can be idle before it gets deleted.
notebooksDirstrnotebooksThe directory containing the notebook files.
srcDirstrsrcThe directory containing the src files.
customDockerfilestrThe location of the custom notebook dockerfile.
pythonVersionstrThe python version used in your project.
awsRolestrThe AWS role used by your notebook.
azureApplicationClientIdstringThe Azure service principal used by the container.
instanceTypestrmx.smallThe Conveyor instance type to use for your notebook. This specifies how much CPU and memory can be used.
instanceLifeCyclestringon-demandThe lifecycle of the instance used to run the notebook. Options are on-demand, spot.
envVariablesmapExtra environment variables or secrets you want to mount inside the notebook container.
diskSizeint10The amount of storage (in GB) reserved for your notebook to store your code and other local files. This can not be changed after creating the notebook.

Templating

In the notebook configuration, you can also apply templating. This is useful if you want to change certain settings according to the environment you are deploying too.

We support filling in the environment name by using {{ .Env }}. For example:

notebooks:
- name: default
pythonSpec:
awsRole: sample-python-notebooks-{{ .Env }}

The underlying templating engine used is the golang templating engine, the docs of that can be found here.

Instances

Conveyor supports the following instances types for any jobs:

Instance typeCPUTotal Memory (AWS)Total Memory (Azure)
mx.nano1*0.438 Gb0.375 Gb
mx.micro1*0.875 Gb0.75 Gb
mx.small1*1.75 Gb1.5 Gb
mx.medium13.5 Gb3 Gb
mx.large27 Gb6 Gb
mx.xlarge414 Gb12 Gb
mx.2xlarge829 Gb26 Gb
mx.4xlarge1659 Gb55 Gb
cx.nano1*0.219 GbNot supported
cx.micro1*0.438 GbNot supported
cx.small1*0.875 GbNot supported
cx.medium11.75 GbNot supported
cx.large23.5 GbNot supported
cx.xlarge47 GbNot supported
cx.2xlarge814 GbNot supported
cx.4xlarge1629 GbNot supported
rx.xlarge428 GbNot supported
rx.2xlarge859 GbNot supported
rx.4xlarge16120 GbNot supported
info

(*) These instance types don't get a guaranteed full CPU but only a slice of a full CPU, but they are allowed to burst up to a full CPU if the cluster allows.

The numbers for AWS and Azure differ because nodes on both clouds run different DaemonSets and have different reservation requirements set by the provider. We aim to minimize the node overhead as much as possible while still obeying the minimum requirements of each cloud provider.

Instance life cycle

On the notebook container we can set an instance life cycle. This will result in your job running on on-demand or on spot instances. Spot instances can result in discounts of up to 90% compared to on-demand prices. The downside is that your container can be canceled when AWS reclaims such a spot instance, which is what we call a spot interrupt. Luckily this rarely happens in practice.

For notebooks, two life cycle options are supported:

  • on-demand: The container will be run on on-demand instances. This will ensure the notebook does not get deleted by AWS but it takes longer to start up.
  • spot: The container will be run on spot instances. This is the cheapest method, the container can be killed by a spot instance interruption in which case all your changes will be lost.

Env variables

We support adding environment variables to your notebook container. These can be plain values, but also secrets coming from SSM or Secrets Manager. These last two make it possible to mount and expose secrets securely into your notebook container.

Specifying environment variables for your notebook is done as follows:

notebooks:
- name: default
pythonSpec:
envVariables:
foo:
value: bar
testSSM:
awsSSMParameterStore:
name: /conveyor-dp-samples
testSecretManager:
awsSecretsManager:
name: conveyor-dp-samples

In order to mount secrets, the AWS role attached to the notebook container should have permissions to access the secrets. Adding a policy to an AWS role to read SSM parameters/secrets with Terraform is done as follows:

data "aws_iam_policy_document" "allow secrets" {
statement {
actions = [
"ssm:GetParametersByPath",
"ssm:GetParameters",
"ssm:GetParameter",
]
resources = [
"arn:aws:ssm:Region:AccountId:parameter/conveyor-dp-samples/*"
]
effect = "Allow"
}

statement {
actions = [
"secretsmanager:DescribeSecret",
"secretsmanager:List*",
"secretsmanager:GetSecretValue"
]
resources = [
"arn:aws:secretsmanager:Region:AccountId:secret:conveyor-dp-samples"
]
effect = "Allow"
}
}

Stopping your web UI notebook

When using a Web UI notebook, it's also possible to stop your notebook and start it again later again. All files stored in your workspace /home/jovyan/work are saved across restarts. Your virtual environment is also installed there and will be persisted. To stop a notebook, you can use the CLI: conveyor notebook stop, or use the stop button in the UI. For starting a stopped notebook use conveyor notebook start, or use the UI.

Custom Dockerfile

You are able to use a custom Dockerfile for building notebooks instead of using the default template. We recommend that you start making changes based on the default template, which can be obtained through the following command:

conveyor notebook export -f notebook_Dockerfile

To use this notebook_Dockerfile, you should update the notebooks.yaml file as follows:

notebooks:
- name: default
customDockerfile: notebook_Dockerfile

The custom Dockerfile should only contain logic to build and work with your src/notebook files or install additional packages required for running your notebooks. The base image handles everything related to the Jupyter setup and the installed kernels.

The base image starts from an Ubuntu LTS version (currently 20.04).

For common extensions to the default notebook Dockerfile, we created a how-to page.

Base image templating

The FROM statement in your Dockerfile, may be templated as follows:

FROM {{ .BaseNotebookImage }}

This will fill in the correct base image according to the specified python version and the current Conveyor version. Another option is to fix the base image, to avoid updates/changes to the docker image. In that case you should specify the full image name: FROM 776682305951.dkr.ecr.eu-west-1.amazonaws.com/conveyor/data-plane/notebook-python:3.9-conveyor_version

Constraints

  • Your base image should be a notebook base image provided by Conveyor
  • Make sure that all files you want to use in the UI, are under the /home/jovyan/work/conveyor_project directory, these files will also be persisted across restarts
  • At the moment, all our notebook base images use virtual environments and thus do not support conda

Notebook base images

Currently, we only provide python notebook images. The prefix for all images is:

776682305951.dkr.ecr.eu-west-1.amazonaws.com/conveyor/data-plane/

The list of supported notebook images, together with their respective python and spark version, is as follows:

namePythonSpark
notebook-python:3.73.73.2.1-hadoop-3.3.1
notebook-python:3.83.83.2.1-hadoop-3.3.1
notebook-python:3.93.93.2.1-hadoop-3.3.1
notebook-python:3.103.103.2.1-hadoop-3.3.1

PySpark

We currently only support Spark in client mode for notebooks, which means that Spark runs within your notebook container on a single node. You can scale out vertically by changing the instance type to increase the amount of memory/cpu available for Spark.

More details on the supported versions can be found in the section Notebook base images. Similar as for conveyor projects, we provide the necessary libraries to interact with AWS and Azure.

To use Spark in your Jupyter notebook, you only need to create a Spark session as follows:

from pyspark.sql import SparkSession

session = (
SparkSession.builder.appName("pyspark sample")
.enableHiveSupport()
.getOrCreate()
)