Generate your DAG from a YAML file

In certain instances it is nice to create a YAML file that drives the task creation of your DAG.

A typical example is an ingestion flow, where you use the YAML file to list all tables in an easy-to-read manner instead of defining it in the DAG code itself.

However, to make this pattern work, you need to read in the YAML file from the code that defines your DAG. For that, you need to be able to determine the location of your YAML file. This guide provides a robust way to construct this pattern.

First we assume the following structure of your dags folder inside your project:

___ dags
   \___ dag.py
   \___ ingest_tables.yaml

In this example in our dag.py we need to read in the ingest_tables.yaml. A naïve way to read in the file would be the following:

import yaml

def read_tables():
    with open("ingest_tables.yaml") as f:
        return yaml.safe_load(f)

Unfortunately, this will fail. The code depends on the working directory of the process loading your DAGs, but as you do not have any direct control over this, it's best not to depend on it.

A more robust way to implement this, is the following:

import yaml
import os

def read_tables():
    with open(os.path.join(os.path.dirname(os.path.realpath(__file__)), "ingest_tables.yaml")) as f:
        return yaml.safe_load(f)

Now let's unpack what this does:

os.path.realpath(__file__) gives you the real path of your current python file, in this case your DAG definition.
os.path.dirname(...) gives you the directory of the path of your DAG file, this gives you a stable location.
os.path.join(dir, "ingest_tables.yaml), joins the directory and the file name, this makes sure the correct path seperator is used for the platform where your Python code is running. Path separators are different on different operating systems, so paths on Windows will differ from Linux or macOS.

This version does not depend on any hidden knowledge about the working directory and is robust to changes to how the DAGs are loaded. It is the recommended way to load in your file and is tested and supported by the Conveyor team.