Creating your own base images

Base images are created to enable standardization of ides across projects within your organization. They allow organization to package a set of tools and configuration that is common for many projects within the organization. Base images are a starting point for an IDE, and form basically the FROM statement in a Dockerfile.

The specification of a base image is done in the ide.yaml file, which is described in detail by the IDE reference page.

Every Conveyor installation comes with a default base image, from which every other base image starts and which is managed by the Conveyor team.

We provided 3 base images to start with:

dbt base image
Pyspark base image
Notebook base image

If you want to create your own base image, you can start from one of these examples and modify it to your needs. The best practice for creating your own base image is described here.

If you are not familiar with how to write multiline strings in YAML, please refer to the YAML multiline strings section.

dbt base image

The goal of this base image is to provide a standard environment for dbt projects. This allows anyone to get started quickly on using the dbt project, without having to install all the dependencies themselves.

A good place to start for a dbt ide.yaml is as follows:

vscode:
  extensions:
    - innoverio.vscode-dbt-power-user
    - dorzey.vscode-sqlfluff
    - mtxr.sqltools
    - mtxr.sqltools-driver-pg
    - koszti.snowflake-driver-for-sqltools
buildSteps:
  - name: intall dbt with the standard adapters
    cmd: |
      sudo apt-get update
      sudo apt-get install -y python3-pip
      sudo pip3 install dbt-core==1.9.1 dbt-duckdb==1.9.1 dbt-postgres==1.9.0 dbt-redshift==1.9.0 dbt-snowflake==1.9.0 dbt-trino==1.9.0

PySpark base image

The goal of this base image is to make sure that the necessary libraries are installed in order for the PySpark environment to work.

A good starting point for the pyspark ide.yaml is as follows:

vscode:
  extensions:
    - ms-toolsai.jupyter
    - ms-python.python
buildSteps:
  - name: install openjdk
    cmd: |
      echo 'export SPARK_HOME=/opt/spark' >> ~/.bashrc
      sudo apt-get update && sudo apt-get install -y --no-install-recommends gcc g++ software-properties-common openjdk-11-jre unzip curl
      JAVA_PATH=$(sudo update-alternatives --list java | sed -e "s/\/bin\/java//")
      echo "export JAVA_HOME=$JAVA_PATH" >> ~/.bashrc
  - name: add spark libraries with conveyor specific patches and add them to the python environment
    cmd: |
      curl -X GET https://static.conveyordata.com/spark/spark-3.5.1-hadoop-3.3.6-v1.zip -o spark.zip \
        && sudo unzip ./spark.zip -d /opt \
        && rm ./spark.zip && sudo chmod -R 777 /opt/spark
      echo 'source /opt/spark/sbin/spark-config.sh' >> ~/.bashrc
  - name: set default spark configuration for aws and azure
    cmd: |
      mkdir -p /opt/spark/conf
      cat <<-EOF > /opt/spark/conf/spark-defaults.conf
      spark.hadoop.fs.s3.impl                             org.apache.hadoop.fs.s3a.S3AFileSystem
      spark.hadoop.fs.s3a.aws.credentials.provider        com.amazonaws.auth.DefaultAWSCredentialsProviderChain
      spark.kubernetes.pyspark.pythonVersion              3
      spark.hadoop.hive.metastore.client.factory.class    com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
      spark.hadoop.hive.imetastoreclient.factory.class    com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
      spark.eventLog.enabled                              false
      spark.hadoop.fs.azure.account.auth.type             Custom
      spark.hadoop.fs.azure.account.oauth.provider.type   cloud.datafy.azure.auth.MsalTokenProvider
      EOF

note

If you want to automate the installation of python dependencies, you can add that using the .vscode/tasks.json file. More details about this can be found in the following how to.

Notebook base image

In order to use an ide as a notebook environment, you can start from the following ide.yaml:

vscode:
  extensions:
    - ms-toolsai.jupyter
    - ms-python.python

note

If you want to automate the installation of python dependencies, you can add that using the .vscode/tasks.json file. More details about this can be found in the customizing your IDE how-to page.

best practices for creating custom base images

When creating your own base image, it is recommended to start from one of the templates provided above. If you have a specific use case that is not covered by the templates, you can start from scratch. Even in this case it can be useful to take a look at the templates to see which action can be performed.

How to set environment variables

The ide specification does not allow you to set environment variables directly. If you want to define environment variables, you should write it to your bashrc file as follows:

export SPARK_HOME=/opt/spark' >> ~/.bashrc

This ensures that the environment variable is set for every new terminal session.

Improve the feedback cycle using local builds

In order to get fast feedback while iterating over your base image configuration, you can use the local-build option in Conveyor as follows:

conveyor ide build-base-image --name dummy --configuration file://custom-base-image.md --local-build

YAML multiline strings

In YAML there are two ways to define multiline strings:

| which preserves newlines in the content of the string
> which removes newlines and thus creates a single line string even if the content spans multiple lines

The library we use prefers > over |, which is why your input can be reformatted to use > instead of |. This is not a problem as long as you are aware of the difference between the two.

Given the following input:

buildSteps:
  - name: some multiline string
    cmd: |
      sudo apt-get update
      sudo apt-get install -y curl
      curl -X GET https://google.com

This is equivalent to:

buildSteps:
  - name: some multiline string
    cmd: >
      sudo apt-get update

      sudo apt-get install -y curl

      curl -X GET https://google.com

The empty lines must be added in order to have the same meaning as the original input. If you remove the newlines and use > instead, the cmd content will be executed as a single line command, which in this case will result in an error as the update command does not expect additional arguments.

Adding private resources

Sometimes you might need to add private resources (e.g. blob artifacts, container artifacts,...) to your base image, which are not publicly accessible by Conveyor. For this reason, we allow adding an IAM identity, which will be used to access the private resources while building your base image.

Making this work for your usecase requires the following steps:

In order to make this work, you will need to add the necessary permissions to your IAM identity.
You also need to make sure that your base image build can assume the IAM identity. This is achieved by adding the correct trust relationship between the service account and the IAM identity.
You need to pass the IAM identity for the base image build through the Conveyor UI or CLI.
Finally, add the correct buildSteps in your base image to download/copy the private resources.

important

The temporary credentials passed to the base image build are currently only valid for an hour. If your builds take longer than an hour, you will get errors while accessing these private resources.

Adding an IAM identity to your base image builds

For context on how Conveyor uses IAM identities, look at the IAM identity documentation. Conveyor creates a service account called conveyor-ide-builder in the conveyoridebuilds namespace which is used by every build. To make sure that the build can use an iam identity, you need to setup the trust relationship between this service account and your IAM identity.

AWS

An example for setting up the trust relationship using Terraform is as follows:

resource "aws_iam_role" "default" {
  name               = "ide-builder-role-${var.env_name}"
  assume_role_policy = data.aws_iam_policy_document.default.json
}

data "aws_iam_policy_document" "default" {
  statement {
    actions = ["sts:AssumeRoleWithWebIdentity"]
    effect  = "Allow"

    condition {
      test     = "StringLike"
      variable = "${replace(var.aws_iam_openid_connect_provider_url, "https://", "")}:sub"
      values   = [
        "system:serviceaccount:conveyoridebuilds:conveyor-ide-builder"
      ]
    }

    principals {
      identifiers = [var.aws_iam_openid_connect_provider_arn]
      type        = "Federated"
    }
  }
}

Azure

An example for setting up the trust relationship using Terraform is as follows:

resource "azuread_application" "ide-builder" {
  display_name = "ide-builder"
}

resource "azuread_application_federated_identity_credential" "ide-builder" {
  application_id = azuread_application.ide-builder.id
  display_name          = "kubernetes-federated-identity-ide-builder"
  audiences             = ["api://AzureADTokenExchange"]
  issuer                = var.oidc_issuer_url
  subject               = "system:serviceaccount:conveyoridebuilds:conveyor-ide-builder"
}

resource "azuread_service_principal" "ide-builder" {
  client_id                    = azuread_application.ide-builder.client_id
  app_role_assignment_required = false
}

Add a build step that uses private resources

As an example, we will create a build step that copies an object from blob storage to the base image. As the object is private, we need to use the IAM identity that we created in the previous step.

AWS

An example for copying an object from S3 to the base image is as follows:

buildSteps:
  - name: copy object from s3
    cmd: |
      aws s3 cp s3://<s3-path-to-object> <local-file-name>

The AWS cli is installed by default in the base image which is why we don't need to install it in the build step. Furthermore, the AWS cli will automatically use the IAM identity based on the environment variables exposed by Conveyor. There is thus no need to login explicitly.

Azure

An example for copying an object from blob storage to the base image is as follows:

buildSteps:
  - name: install azure CLI in the base image
    cmd: |
      sudo curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
  - name: login using azure workload identity and download blob from azure blob storage
    cmd: >
      az login --federated-token "$(cat $AZURE_FEDERATED_TOKEN_FILE)"
      --service-principal -u $AZURE_CLIENT_ID -t $AZURE_TENANT_ID && az storage
      blob download -c CONTAINER_NAME -n <azure-blob-file-path>
      --account-name STORAGE_ACCOUNT -f <local-file-name>

The Azure CLI is not installed by default in the base image, so we need to install it in the build step. The Azure CLI does not automatically login using workload identity, which is why we do this explicitly before accessing blob storage.

dbt base image​

PySpark base image​

Notebook base image​

best practices for creating custom base images​

How to set environment variables​

Improve the feedback cycle using local builds​

YAML multiline strings​

Adding private resources​

Adding an IAM identity to your base image builds​

AWS​

Azure​

Add a build step that uses private resources​

AWS​

Azure​

dbt base image

PySpark base image

Notebook base image

best practices for creating custom base images

How to set environment variables

Improve the feedback cycle using local builds

YAML multiline strings

Adding private resources

Adding an IAM identity to your base image builds

AWS

Azure

Add a build step that uses private resources

AWS

Azure