Skip to main content

Using multiple containers

Description

You might run into this pattern when you have some custom logic for ingesting your data. For example, you use a Python script fetch files from and FTP server into S3, but your cleaning logic is written using Spark.

In this case, suppose you have two code bases in the same project:

  • A Python script to fetch data from the FTP into S3
  • A Spark/PySpark job to clean the data

Adding your script into the existing Docker container can make things more complex because it makes testing more difficult or inflates your Docker image with extra dependencies. If working with a single Docker container becomes hard, it might be interesting to create two separate Conveyor projects for your use case as follows:

|———project
|———spark
| |———.conveyor/project.yaml
| |———Dockerfile
|———python
|———.conveyor/project.yaml
|———Dockerfile

This way we can build two Docker containers and have a different release cycle for the projects (or the same if you always build and release them at the same time). Finally, we still need to build one DAG that links both Conveyor projects, which is possible by constructing an Airflow DAG with a templated image name. The result will look as follows:

from conveyor.operators import ConveyorContainerOperatorV2, ConveyorSparkSubmitOperatorV2

ConveyorContainerOperatorV2(
dag=dag,
task_id="ingest-python",
image="{{ macros.conveyor.image('python') }}",
)

ConveyorSparkSubmitOperatorV2(
dag=dag,
task_id="clean-spark",
num_executors="1",
conn_id="spark-k8s",
env_vars={"AWS_REGION": "eu-west-1"},
conf={
"spark.kubernetes.container.image": "{{ macros.conveyor.image('spark') }}",
},
)

The DAG is simplified for this example, for the full list of operator options check the Airflow operators page. This shows how you can use different Docker images in the same dag. As a consequence, you should always deploy both projects to an environment to test the Airflow DAG.

You can put the DAG in either one of the projects.

|———project
|———spark
| |———.conveyor/project.yaml
| |———Dockerfile
| |———dags
| |———dag.py
|———python
|———.conveyor/project.yaml
|———Dockerfile

This means that the release cycle is coupled to the Spark project as it schedules both the ingestion and the cleaning task. If you need to change the DAG because the Python project should be called differently, you need to deploy both the Python project and the Spark project.