Creating your own base images
Base images are currently in preview, we currently reserve the right to make breaking changes if we need to. However, we will do our best to keep things backwards compatible.
Base images are created to enable standardization of ides across projects within your organization.
They allow organization to package a set of tools and configuration that is common for many projects within the organization.
Base images are a starting point for an IDE, and form basically the FROM
statement in a Dockerfile.
The specification of a base image is done in the ide.yaml
file, which is described in detail on the ide page.
Every Conveyor installation comes with a default base image, from which every other base image starts and which is managed by the Conveyor team. This how-to guide helps you to create your own base images for a specific use cases, which you can make available to all teams similar to the default base image.
We provided 3 base images to start with:
If you are not familiar with how to write multiline strings in YAML, please refer to the YAML multiline strings section.
dbt base image
The goal of this base image is to provide a standard environment for dbt projects. This allows anyone to get started quickly on using the dbt project, without having to install all the dependencies themselves.
A good place to start for a dbt ide.yaml
is as follows:
vscode:
extensions:
- innoverio.vscode-dbt-power-user
- dorzey.vscode-sqlfluff
- mtxr.sqltoolsext
- mtxr.sqltools-driver-pg
- RandomFractalsInc.duckdb-sql-tools
- koszti.snowflake-driver-for-sqltools
- kj.sqltools-driver-redshift
- regadas.sqltools-trino-driver
buildSteps:
- name: intall dbt with the standard adapters
cmd: |
sudo apt-get update
sudo apt-get install -y python3-pip
sudo pip3 install dbt-core==1.7.8 dbt-duckdb==1.7.1 dbt-postgres==1.7.8 dbt-redshift==1.7.3 dbt-snowflake==1.7.2 dbt-trino==1.7.1
Pyspark base image
The goal of this base image is to make sure that the necessary libraries are installed in order for the pyspark environment to work.
A good starting point for the pyspark ide.yaml
is as follows:
vscode:
extensions:
- ms-toolsai.jupyter
- ms-python.python
buildSteps:
- name: install openjdk
cmd: |
echo 'export SPARK_HOME=/opt/spark' >> ~/.bashrc
sudo apt-get update && sudo apt-get install -y --no-install-recommends gcc g++ software-properties-common openjdk-11-jre unzip curl
- name: add spark libraries with conveyor specific patches and add them to the python environment
cmd: |
curl -X GET https://static.conveyordata.com/spark/spark-3.5.1-hadoop-3.3.6-v1.zip -o spark.zip && sudo unzip ./spark.zip -d /opt && rm ./spark.zip && sudo chmod -R 777 /opt/spark
echo 'source /opt/spark/sbin/spark-config.sh' >> ~/.bashrc
- name: set default spark configuration for aws and azure
cmd: |
mkdir -p /opt/spark/conf
cat <<-EOF > /opt/spark/conf/spark-defaults.conf
spark.hadoop.fs.s3.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.aws.credentials.provider com.amazonaws.auth.DefaultAWSCredentialsProviderChain
spark.kubernetes.pyspark.pythonVersion 3
spark.hadoop.hive.metastore.client.factory.class com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
spark.hadoop.hive.imetastoreclient.factory.class com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
spark.eventLog.enabled false
spark.hadoop.fs.azure.account.auth.type Custom
spark.hadoop.fs.azure.account.oauth.provider.type cloud.datafy.azure.auth.MsalTokenProvider
EOF
If you want to automate the installation of python dependencies, you can add that using the .vscode/tasks.json
file.
More details about this can be found in the following how to.
Notebook base image
In order to use an ide as a notebook environment, you can start from the following ide.yaml
:
vscode:
extensions:
- ms-toolsai.jupyter
- ms-python.python
If you want to automate the installation of python dependencies, you can add that using the .vscode/tasks.json
file.
More details about this can be found in the following how to.
YAML multiline strings
In YAML there are two ways to define multiline strings:
|
which preserves newlines in the content of the string>
which removes newlines and thus creates a single line string even if the content spans multiple lines
The library we use prefers >
over |
, which is why your input can be reformatted to use >
instead of |
.
This is not a problem as long as you are aware of the difference between the two.
Given the following input:
buildSteps:
- name: some multiline string
cmd: |
sudo apt-get update
sudo apt-get install -y curl
curl -X GET https://google.com
Is equivalent to:
buildSteps:
- name: some multiline string
cmd: >
sudo apt-get update
sudo apt-get install -y curl
curl -X GET https://google.com
The empty lines must be added in order to have the same meaning as the original input.
If you remove the newlines and use >
then the cmd content will be executed as a single line command, which in this case will result in an error as the update command does not expect additional arguments.