Skip to main content

Using a yaml file to generate tasks in a dag

In certain instances it is nice to create a yaml file that drives the task creation on your dag.

A typical example is an ingest flow, where you use the yaml file to list all tables in an easy-to-read manner instead of defining it in the dag code itself.

However to make the work you need to read in the yaml file in your dag code and for that you need to know the location in guide we will show a robust way to do this.

First we assume the following structure of your dags folder in your project:

___ dags
\___ dag.py
\___ ingest_tables.yaml

In this example in our dag.py we need to read in the ingest_tables.yaml an naïve way to read in the file would be the following:

import yaml

def read_tables():
with open("ingest_tables.yaml") as f:
return yaml.safe_load(f)

However, this will fail. The code depends on the working directory of the process loading your dags, since you do not have any direct control over this it's best not to depend on it.

A more robust way to implement this is the following:

import yaml
import os

def read_tables():
with open(os.path.join(os.path.dirname(os.path.realpath(__file__)), "ingest_tables.yaml")) as f:
return yaml.safe_load(f)

Now let's unpack what this does:

  • os.path.realpath(__file__) gives you the real path of your current python file, in this case your dag
  • os.path.dirname(...) gives you the directory of the path of your dag file, this gives you a stable location
  • os.path.join(dir, "ingest_tables.yaml), joins the directory and the file name, this makes sure the correct path seperator is used according to the OS the python code is running on. Path separators are different on Windows vs Linux and Mac OSX

This version does not depend on any hidden knowledge about the working directory and is robust to changes to how the DAGs are loaded. It is the recommended way to load in your file and is tested and supported by the conveyor team