7. Run multiple tasks for a given project
At the moment, all the models in the same project as well as the tests run in the same task. This is fine for small projects, with a limited number of models and dependencies. However, when your project grows, you would like to split models over different tasks:
- Simplify rerunning failed tasks
- Speeds up running tasks as it can increase parallellism
- Better overview of which model failed in Airflow
- separate test tasks from model runs
For this reason, dbt supports selectors, which allows you to specify which models to run with a given command. This way you can choose which models to run for a given task
# Only run models in a given directory
dbt run --select models/example
# Only run models that have the nightly tag
dbt run --select tag:nightly
7.1 Create a second model and run it as a separate task
Let's create a second model that is just a copy from the raw_customers table by using the following code:
{{ config(materialized='external', location='s3://<conveyor_demo_XYZ>/model/customers.parquet') }}
with customers as (
select
id as customer_id,
first_name,
last_name
from {{ source('external_source', 'raw_customers') }}
)
SELECT * FROM customers
Do not forget to update the location property on line 1 with your actual bucket
Update the dags/$PROJECT_NAME.py
file with a second ConveyorContainerTask
and also make sure that both tasks select only 1 model as follows:
from conveyor.operators import ConveyorContainerOperatorV2
ConveyorContainerOperatorV2(
dag=dag,
task_id="task1",
arguments=["build", "--target", "dev", "--select", "customer_orders"],
...
)
ConveyorContainerOperatorV2(
dag=dag,
task_id="task2",
arguments=["build", "--target", "dev", "--select", "customers"],
...
)
We now have two tasks that each run and test 1 model. They can run in parallel since they do not depend on each other.
When using dbt and DuckDB, you can best put all models that depend on each other in 1 task (e.g. models that use the ref
function).
Models for distinct use cases that use data loaded from external sources can safely be separated in different tasks.
7.2 (Alternative) Try out the ConveyorDbtTaskFactory
The ConveyorDbtTaskFactory
goes one step further and creates 1 run task and 1 test task for every model in your project.
The ConveyorDbtTaskFactory
is not very that helpful when using duckdb since the model runs and tests are separated in different tasks.
This implies that you need to persist the database state across task runs which is at the moment very difficult.
If you want more information on how to use the ConveyorDbtTaskFactory
, take a look at the following how-to guide
7.3 Redeploy your code
Now, re-build the project and deploy to your environment.
conveyor build
conveyor deploy --env $ENVIRONMENT_NAME --wait
7.4 Re-run the tasks
The initial deployment of your project ran all models in the same task. We will instruct Airflow to re-run with the updated code that runs the 2 models in different tasks.
In the Conveyor UI navigate to your environment and open Airflow.
Navigate to your project Dag and re-trigger the dag by clicking the >
and selecting trigger dag
in the Airflow UI.
You should see that both tasks succeed.