Skip to main content

Creating your own base images

caution

Base images are currently in preview, we currently reserve the right to make breaking changes if we need to. However, we will do our best to keep things backwards compatible.

Base images are created to enable standardization of ides across projects within your organization. They allow organization to package a set of tools and configuration that is common for many projects within the organization. Base images are a starting point for an IDE, and form basically the FROM statement in a Dockerfile.

The specification of a base image is done in the ide.yaml file, which is described in detail on the ide page.

Every Conveyor installation comes with a default base image, from which every other base image starts and which is managed by the Conveyor team. This how-to guide helps you to create your own base images for a specific use cases, which you can make available to all teams similar to the default base image.

We provided 3 base images to start with:

If you are not familiar with how to write multiline strings in YAML, please refer to the YAML multiline strings section.

dbt base image

The goal of this base image is to provide a standard environment for dbt projects. This allows anyone to get started quickly on using the dbt project, without having to install all the dependencies themselves.

A good place to start for a dbt ide.yaml is as follows:

vscode:
extensions:
- innoverio.vscode-dbt-power-user
- dorzey.vscode-sqlfluff
- mtxr.sqltoolsext
- mtxr.sqltools-driver-pg
- RandomFractalsInc.duckdb-sql-tools
- koszti.snowflake-driver-for-sqltools
- kj.sqltools-driver-redshift
- regadas.sqltools-trino-driver
buildSteps:
- name: intall dbt with the standard adapters
cmd: |
sudo apt-get update
sudo apt-get install -y python3-pip
sudo pip3 install dbt-core==1.7.8 dbt-duckdb==1.7.1 dbt-postgres==1.7.8 dbt-redshift==1.7.3 dbt-snowflake==1.7.2 dbt-trino==1.7.1

Pyspark base image

The goal of this base image is to make sure that the necessary libraries are installed in order for the pyspark environment to work.

A good starting point for the pyspark ide.yaml is as follows:

vscode:
extensions:
- ms-toolsai.jupyter
- ms-python.python
buildSteps:
- name: install openjdk
cmd: |
echo 'export SPARK_HOME=/opt/spark' >> ~/.bashrc
sudo apt-get update && sudo apt-get install -y --no-install-recommends gcc g++ software-properties-common openjdk-11-jre unzip curl
- name: add spark libraries with conveyor specific patches and add them to the python environment
cmd: |
curl -X GET https://static.conveyordata.com/spark/spark-3.5.1-hadoop-3.3.6-v1.zip -o spark.zip && sudo unzip ./spark.zip -d /opt && rm ./spark.zip && sudo chmod -R 777 /opt/spark
echo 'source /opt/spark/sbin/spark-config.sh' >> ~/.bashrc
- name: set default spark configuration for aws and azure
cmd: |
mkdir -p /opt/spark/conf
cat <<-EOF > /opt/spark/conf/spark-defaults.conf
spark.hadoop.fs.s3.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.aws.credentials.provider com.amazonaws.auth.DefaultAWSCredentialsProviderChain
spark.kubernetes.pyspark.pythonVersion 3
spark.hadoop.hive.metastore.client.factory.class com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
spark.hadoop.hive.imetastoreclient.factory.class com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
spark.eventLog.enabled false
spark.hadoop.fs.azure.account.auth.type Custom
spark.hadoop.fs.azure.account.oauth.provider.type cloud.datafy.azure.auth.MsalTokenProvider
EOF
note

If you want to automate the installation of python dependencies, you can add that using the .vscode/tasks.json file. More details about this can be found in the following how to.

Notebook base image

In order to use an ide as a notebook environment, you can start from the following ide.yaml:

vscode:
extensions:
- ms-toolsai.jupyter
- ms-python.python
note

If you want to automate the installation of python dependencies, you can add that using the .vscode/tasks.json file. More details about this can be found in the following how to.

YAML multiline strings

In YAML there are two ways to define multiline strings:

  • | which preserves newlines in the content of the string
  • > which removes newlines and thus creates a single line string even if the content spans multiple lines

The library we use prefers > over |, which is why your input can be reformatted to use > instead of |. This is not a problem as long as you are aware of the difference between the two.

Given the following input:

buildSteps:
- name: some multiline string
cmd: |
sudo apt-get update
sudo apt-get install -y curl
curl -X GET https://google.com

Is equivalent to:

buildSteps:
- name: some multiline string
cmd: >
sudo apt-get update

sudo apt-get install -y curl

curl -X GET https://google.com

The empty lines must be added in order to have the same meaning as the original input. If you remove the newlines and use > then the cmd content will be executed as a single line command, which in this case will result in an error as the update command does not expect additional arguments.