Skip to main content

Creating your own base images

Base images are created to enable standardization of ides across projects within your organization. They allow organization to package a set of tools and configuration that is common for many projects within the organization. Base images are a starting point for an IDE, and form basically the FROM statement in a Dockerfile.

The specification of a base image is done in the ide.yaml file, which is described in detail by the IDE reference page.

Every Conveyor installation comes with a default base image, from which every other base image starts and which is managed by the Conveyor team.

We provided 3 base images to start with:

If you want to create your own base image, you can start from one of these examples and modify it to your needs. The best practice for creating your own base image is described here.

If you are not familiar with how to write multiline strings in YAML, please refer to the YAML multiline strings section.

dbt base image

The goal of this base image is to provide a standard environment for dbt projects. This allows anyone to get started quickly on using the dbt project, without having to install all the dependencies themselves.

A good place to start for a dbt ide.yaml is as follows:

vscode:
extensions:
- innoverio.vscode-dbt-power-user
- sqlfluff.vscode-sqlfluff
- mtxr.sqltools
- mtxr.sqltools-driver-pg
- koszti.snowflake-driver-for-sqltools
buildSteps:
- name: intall dbt with the standard adapters
cmd: |
sudo apt-get update
sudo apt-get install -y python3-pip
sudo pip3 install dbt-core==1.7.8 dbt-duckdb==1.7.1 dbt-postgres==1.7.8 dbt-redshift==1.7.3 dbt-snowflake==1.7.2 dbt-trino==1.7.1

PySpark base image

The goal of this base image is to make sure that the necessary libraries are installed in order for the PySpark environment to work.

A good starting point for the pyspark ide.yaml is as follows:

vscode:
extensions:
- ms-toolsai.jupyter
- ms-python.python
buildSteps:
- name: install openjdk
cmd: |
echo 'export SPARK_HOME=/opt/spark' >> ~/.bashrc
sudo apt-get update && sudo apt-get install -y --no-install-recommends gcc g++ software-properties-common openjdk-11-jre unzip curl
JAVA_PATH=$(sudo update-alternatives --list java | sed -e "s/\/bin\/java//")
echo "export JAVA_HOME=$JAVA_PATH" >> ~/.bashrc
- name: add spark libraries with conveyor specific patches and add them to the python environment
cmd: |
curl -X GET https://static.conveyordata.com/spark/spark-3.5.1-hadoop-3.3.6-v1.zip -o spark.zip \
&& sudo unzip ./spark.zip -d /opt \
&& rm ./spark.zip && sudo chmod -R 777 /opt/spark
echo 'source /opt/spark/sbin/spark-config.sh' >> ~/.bashrc
- name: set default spark configuration for aws and azure
cmd: |
mkdir -p /opt/spark/conf
cat <<-EOF > /opt/spark/conf/spark-defaults.conf
spark.hadoop.fs.s3.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.aws.credentials.provider com.amazonaws.auth.DefaultAWSCredentialsProviderChain
spark.kubernetes.pyspark.pythonVersion 3
spark.hadoop.hive.metastore.client.factory.class com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
spark.hadoop.hive.imetastoreclient.factory.class com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
spark.eventLog.enabled false
spark.hadoop.fs.azure.account.auth.type Custom
spark.hadoop.fs.azure.account.oauth.provider.type cloud.datafy.azure.auth.MsalTokenProvider
EOF
note

If you want to automate the installation of python dependencies, you can add that using the .vscode/tasks.json file. More details about this can be found in the following how to.

Notebook base image

In order to use an ide as a notebook environment, you can start from the following ide.yaml:

vscode:
extensions:
- ms-toolsai.jupyter
- ms-python.python
note

If you want to automate the installation of python dependencies, you can add that using the .vscode/tasks.json file. More details about this can be found in the customizing your IDE how-to page.

best practices for creating custom base images

When creating your own base image, it is recommended to start from one of the templates provided above. If you have a specific use case that is not covered by the templates, you can start from scratch. Even in this case it can be useful to take a look at the templates to see which action can be performed.

How to set environment variables

The ide specification does not allow you to set environment variables directly. If you want to define environment variables, you should write it to your bashrc file as follows:

export SPARK_HOME=/opt/spark' >> ~/.bashrc

This ensures that the environment variable is set for every new terminal session.

Improve the feedback cycle using local builds

In order to get fast feedback while iterating over your base image configuration, you can use the local-build option in Conveyor as follows:

conveyor ide build-base-image --name dummy --configuration file://custom-base-image.md --local-build

YAML multiline strings

In YAML there are two ways to define multiline strings:

  • | which preserves newlines in the content of the string
  • > which removes newlines and thus creates a single line string even if the content spans multiple lines

The library we use prefers > over |, which is why your input can be reformatted to use > instead of |. This is not a problem as long as you are aware of the difference between the two.

Given the following input:

buildSteps:
- name: some multiline string
cmd: |
sudo apt-get update
sudo apt-get install -y curl
curl -X GET https://google.com

This is equivalent to:

buildSteps:
- name: some multiline string
cmd: >
sudo apt-get update

sudo apt-get install -y curl

curl -X GET https://google.com

The empty lines must be added in order to have the same meaning as the original input. If you remove the newlines and use > instead, the cmd content will be executed as a single line command, which in this case will result in an error as the update command does not expect additional arguments.