Best practices
Here we provide some patterns to help you get the most out of your packages, both for streamlining your development as for making your packages easier to consume by end-users.
Include a sample project
While developing a package, it is often useful to have a sample project next to it to test out the functionality of your package. This sample project can also serve as a form of documentation for your users once your package is deployed.
Since projects and packages share different configuration files (project.yaml
and package.yaml
respectively),
it's possible to have both a project and package co-exist in the same directory.
This setup could look as follows.
├── .conveyor
│ ├── package.yaml
│ └── project.yaml
├── dags
│ └── sample.py
├── pkgs
│ └── my_package.py
├── src
│ └── ...
├── Dockerfile
└── ...
Package functionality defined in the /pkgs
folder can be immediately imported by Airflow code stored in the /dags
folder.
This allows for very rapid iteration during package development, as it enables the following flow.
- Work on the functions exposed by your package (living in the
/pkgs
) folder. - Publish your modified functionality through
conveyor package trial
. - Run the imported task (defined in
/dags
) throughconveyor project run
without changing directory.
Excluding your Dockerfile
There is one optimisation you can apply to this pattern when your package also exposes a container image.
By default, both the package.yaml
and project.yaml
look at the root level for a Dockerfile.
This means that both your package and project will build the same Dockerfile in such a setup.
To avoid this, you can modify the project.yaml
to look for a Dockerfile in a nonexistent directory.
This will prevent Conveyor from creating a Docker image for your builds, greatly reducing your build time.
(Note that your DAGs will still be included, which is usually what you want.)
The modification applied to your project.yaml
could look like the following example.
docker:
path: ./noload
Exposing your modules
When your packages start containing more functionality and get larger,
you typically start organising your code into more structured modules.
Your /pkgs
folder could for example look like the following.
├── pkgs
│ ├── datalake.py
│ ├── operators.py
│ └── storage_utils.py
└── ...
In order to import these three modules into your DAG, you would write import statements like:
from conveyor import packages
operators = packages.load("my_package.operators", version="1.0.0")
datalake = packages.load("my_package.datalake", version="1.0.0")
storage_utils = packages.load("my_package.storage_utils", version="1.0.0")
# Example usage
task = operators.MyOperator()
result = datalake.my_function()
bucket = storage_utils.get_bucket("dev")
However, Python also allows you to expose modules at the root level of your project by declaring them in an __init__.py
file.
To export these three modules, you can use the following __init__.py
.
from . import datalake, operators, storage_utils
__all__ = ["datalake", "operators", "storage_utils"]
This allows you to shorten the import statements of the example above to:
from conveyor import packages
my_package = packages.load("my_package", version="1.0.0")
# Example usage
task = my_package.operators.MyOperator()
result = my_package.datalake.my_function()
bucket = my_package.storage_utils.get_bucket("dev")
Both ways of importing are roughly equivalent and can be freely mixed.
By including the __init__.py
file, you can offer the same ergonomics to your developers that they might be expecting
from their experience using modern Python libraries.