Creating your own base images
Base images are created to enable standardization of ides across projects within your organization.
They allow organization to package a set of tools and configuration that is common for many projects within the organization.
Base images are a starting point for an IDE, and form basically the FROM
statement in a Dockerfile.
The specification of a base image is done in the ide.yaml
file, which is described in detail by the IDE reference page.
Every Conveyor installation comes with a default base image, from which every other base image starts and which is managed by the Conveyor team.
We provided 3 base images to start with:
If you want to create your own base image, you can start from one of these examples and modify it to your needs. The best practice for creating your own base image is described here.
If you are not familiar with how to write multiline strings in YAML, please refer to the YAML multiline strings section.
dbt base image
The goal of this base image is to provide a standard environment for dbt projects. This allows anyone to get started quickly on using the dbt project, without having to install all the dependencies themselves.
A good place to start for a dbt ide.yaml
is as follows:
vscode:
extensions:
- innoverio.vscode-dbt-power-user
- dorzey.vscode-sqlfluff
- mtxr.sqltools
- mtxr.sqltools-driver-pg
- koszti.snowflake-driver-for-sqltools
buildSteps:
- name: intall dbt with the standard adapters
cmd: |
sudo apt-get update
sudo apt-get install -y python3-pip
sudo pip3 install dbt-core==1.9.1 dbt-duckdb==1.9.1 dbt-postgres==1.9.0 dbt-redshift==1.9.0 dbt-snowflake==1.9.0 dbt-trino==1.9.0
PySpark base image
The goal of this base image is to make sure that the necessary libraries are installed in order for the PySpark environment to work.
A good starting point for the pyspark ide.yaml
is as follows:
vscode:
extensions:
- ms-toolsai.jupyter
- ms-python.python
buildSteps:
- name: install openjdk
cmd: |
echo 'export SPARK_HOME=/opt/spark' >> ~/.bashrc
sudo apt-get update && sudo apt-get install -y --no-install-recommends gcc g++ software-properties-common openjdk-11-jre unzip curl
JAVA_PATH=$(sudo update-alternatives --list java | sed -e "s/\/bin\/java//")
echo "export JAVA_HOME=$JAVA_PATH" >> ~/.bashrc
- name: add spark libraries with conveyor specific patches and add them to the python environment
cmd: |
curl -X GET https://static.conveyordata.com/spark/spark-3.5.1-hadoop-3.3.6-v1.zip -o spark.zip \
&& sudo unzip ./spark.zip -d /opt \
&& rm ./spark.zip && sudo chmod -R 777 /opt/spark
echo 'source /opt/spark/sbin/spark-config.sh' >> ~/.bashrc
- name: set default spark configuration for aws and azure
cmd: |
mkdir -p /opt/spark/conf
cat <<-EOF > /opt/spark/conf/spark-defaults.conf
spark.hadoop.fs.s3.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.aws.credentials.provider com.amazonaws.auth.DefaultAWSCredentialsProviderChain
spark.kubernetes.pyspark.pythonVersion 3
spark.hadoop.hive.metastore.client.factory.class com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
spark.hadoop.hive.imetastoreclient.factory.class com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
spark.eventLog.enabled false
spark.hadoop.fs.azure.account.auth.type Custom
spark.hadoop.fs.azure.account.oauth.provider.type cloud.datafy.azure.auth.MsalTokenProvider
EOF
If you want to automate the installation of python dependencies, you can add that using the .vscode/tasks.json
file.
More details about this can be found in the following how to.
Notebook base image
In order to use an ide as a notebook environment, you can start from the following ide.yaml
:
vscode:
extensions:
- ms-toolsai.jupyter
- ms-python.python
If you want to automate the installation of python dependencies, you can add that using the .vscode/tasks.json
file.
More details about this can be found in the customizing your IDE how-to page.
best practices for creating custom base images
When creating your own base image, it is recommended to start from one of the templates provided above. If you have a specific use case that is not covered by the templates, you can start from scratch. Even in this case it can be useful to take a look at the templates to see which action can be performed.
How to set environment variables
The ide specification does not allow you to set environment variables directly.
If you want to define environment variables, you should write it to your bashrc
file as follows:
export SPARK_HOME=/opt/spark' >> ~/.bashrc
This ensures that the environment variable is set for every new terminal session.
Improve the feedback cycle using local builds
In order to get fast feedback while iterating over your base image configuration, you can use the local-build
option
in Conveyor as follows:
conveyor ide build-base-image --name dummy --configuration file://custom-base-image.md --local-build
YAML multiline strings
In YAML there are two ways to define multiline strings:
|
which preserves newlines in the content of the string>
which removes newlines and thus creates a single line string even if the content spans multiple lines
The library we use prefers >
over |
, which is why your input can be reformatted to use >
instead of |
.
This is not a problem as long as you are aware of the difference between the two.
Given the following input:
buildSteps:
- name: some multiline string
cmd: |
sudo apt-get update
sudo apt-get install -y curl
curl -X GET https://google.com
This is equivalent to:
buildSteps:
- name: some multiline string
cmd: >
sudo apt-get update
sudo apt-get install -y curl
curl -X GET https://google.com
The empty lines must be added in order to have the same meaning as the original input.
If you remove the newlines and use >
instead, the cmd content will be executed as a single line command,
which in this case will result in an error as the update command does not expect additional arguments.
Adding private resources
Sometimes you might need to add private resources (e.g. blob artifacts, container artifacts,...) to your base image, which are not publicly accessible by Conveyor. For this reason, we allow adding an IAM identity, which will be used to access the private resources while building your base image.
Making this work for your usecase requires the following steps:
- In order to make this work, you will need to add the necessary permissions to your IAM identity.
- You also need to make sure that your base image build can assume the IAM identity. This is achieved by adding the correct trust relationship between the service account and the IAM identity.
- You need to pass the IAM identity for the base image build through the Conveyor UI or CLI.
- Finally, add the correct buildSteps in your base image to download/copy the private resources.
The temporary credentials passed to the base image build are currently only valid for an hour. If your builds take longer than an hour, you will get errors while accessing these private resources.
Adding an IAM identity to your base image builds
For context on how Conveyor uses IAM identities, look at the IAM identity documentation.
Conveyor creates a service account called conveyor-ide-builder
in the conveyoridebuilds
namespace which is used by every build.
To make sure that the build can use an iam identity, you need to setup the trust relationship between this service account and your IAM identity.
AWS
An example for setting up the trust relationship using Terraform is as follows:
resource "aws_iam_role" "default" {
name = "ide-builder-role-${var.env_name}"
assume_role_policy = data.aws_iam_policy_document.default.json
}
data "aws_iam_policy_document" "default" {
statement {
actions = ["sts:AssumeRoleWithWebIdentity"]
effect = "Allow"
condition {
test = "StringLike"
variable = "${replace(var.aws_iam_openid_connect_provider_url, "https://", "")}:sub"
values = [
"system:serviceaccount:conveyoridebuilds:conveyor-ide-builder"
]
}
principals {
identifiers = [var.aws_iam_openid_connect_provider_arn]
type = "Federated"
}
}
}
Azure
An example for setting up the trust relationship using Terraform is as follows:
resource "azuread_application" "ide-builder" {
display_name = "ide-builder"
}
resource "azuread_application_federated_identity_credential" "ide-builder" {
application_id = azuread_application.ide-builder.id
display_name = "kubernetes-federated-identity-ide-builder"
audiences = ["api://AzureADTokenExchange"]
issuer = var.oidc_issuer_url
subject = "system:serviceaccount:conveyoridebuilds:conveyor-ide-builder"
}
resource "azuread_service_principal" "ide-builder" {
client_id = azuread_application.ide-builder.client_id
app_role_assignment_required = false
}
Add a build step that uses private resources
As an example, we will create a build step that copies an object from blob storage to the base image. As the object is private, we need to use the IAM identity that we created in the previous step.
AWS
An example for copying an object from S3 to the base image is as follows:
buildSteps:
- name: copy object from s3
cmd: |
aws s3 cp s3://<s3-path-to-object> <local-file-name>
The AWS cli is installed by default in the base image which is why we don't need to install it in the build step. Furthermore, the AWS cli will automatically use the IAM identity based on the environment variables exposed by Conveyor. There is thus no need to login explicitly.
Azure
An example for copying an object from blob storage to the base image is as follows:
buildSteps:
- name: install azure CLI in the base image
cmd: |
sudo curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
- name: login using azure workload identity and download blob from azure blob storage
cmd: >
az login --federated-token "$(cat $AZURE_FEDERATED_TOKEN_FILE)"
--service-principal -u $AZURE_CLIENT_ID -t $AZURE_TENANT_ID && az storage
blob download -c CONTAINER_NAME -n <azure-blob-file-path>
--account-name STORAGE_ACCOUNT -f <local-file-name>
The Azure CLI is not installed by default in the base image, so we need to install it in the build step. The Azure CLI does not automatically login using workload identity, which is why we do this explicitly before accessing blob storage.