Skip to main content

Best practices

Here we provide some patterns to help you get the most out of your packages, both for streamlining your development as for making your packages easier to consume by end-users.

Include a sample project

While developing a package, it is often useful to have a sample project next to it to test out the functionality of your package. This sample project can also serve as a form of documentation for your users once your package is deployed.

Since projects and packages share different configuration files (project.yaml and package.yaml respectively), it's possible to have both a project and package co-exist in the same directory. This setup could look as follows.

├── .conveyor
│ ├── package.yaml
│ └── project.yaml
├── dags
│ └── sample.py
├── pkgs
│ └── my_package.py
├── src
│ └── ...
├── Dockerfile
└── ...

Package functionality defined in the /pkgs folder can be immediately imported by Airflow code stored in the /dags folder. This allows for very rapid iteration during package development, as it enables the following flow.

  1. Work on the functions exposed by your package (living in the /pkgs) folder.
  2. Publish your modified functionality through conveyor package trial.
  3. Run the imported task (defined in /dags) through conveyor project run without changing directory.

Excluding your Dockerfile

There is one optimisation you can apply to this pattern when your package also exposes a container image. By default, both the package.yaml and project.yaml look at the root level for a Dockerfile. This means that both your package and project will build the same Dockerfile in such a setup.

To avoid this, you can modify the project.yaml to look for a Dockerfile in a nonexistent directory. This will prevent Conveyor from creating a Docker image for your builds, greatly reducing your build time. (Note that your DAGs will still be included, which is usually what you want.)

The modification applied to your project.yaml could look like the following example.

project.yaml
docker:
path: ./noload

Exposing your modules

When your packages start containing more functionality and get larger, you typically start organising your code into more structured modules. Your /pkgs folder could for example look like the following.

├── pkgs
│ ├── datalake.py
│ ├── operators.py
│ └── storage_utils.py
└── ...

In order to import these three modules into your DAG, you would write import statements like:

from conveyor import packages

operators = packages.load("my_package.operators", version="1.0.0")
datalake = packages.load("my_package.datalake", version="1.0.0")
storage_utils = packages.load("my_package.storage_utils", version="1.0.0")

# Example usage
task = operators.MyOperator()
result = datalake.my_function()
bucket = storage_utils.get_bucket("dev")

However, Python also allows you to expose modules at the root level of your project by declaring them in an __init__.py file. To export these three modules, you can use the following __init__.py.

__init__.py
from . import datalake, operators, storage_utils

__all__ = ["datalake", "operators", "storage_utils"]

This allows you to shorten the import statements of the example above to:

from conveyor import packages

my_package = packages.load("my_package", version="1.0.0")

# Example usage
task = my_package.operators.MyOperator()
result = my_package.datalake.my_function()
bucket = my_package.storage_utils.get_bucket("dev")

Both ways of importing are roughly equivalent and can be freely mixed. By including the __init__.py file, you can offer the same ergonomics to your developers that they might be expecting from their experience using modern Python libraries.