6. Run against real data
Everything is now in place to update the code and run the project against real data.
6.1 Remove existing files
Remove all the existing files in the models/example
folder such that we can start from a clean slate.
6.2 Configure source data
Duckdb supports loading data that is not in the database, but that is available as an external source. In this tutorial, we will use 2 csv files, containing customer and order data, that we exposed through a public S3 bucket. If you are logged in into your aws account in the browser, you can take a look here.
In order for dbt to know how to locate this source data, you need to add the following sources.yml
file in the models/example
directory.
version: 2
sources:
- name: external_source
meta:
external_location: "s3://datafy-cp-artifacts/conveyor-samples/dbt/sources/{name}.csv"
tables:
- name: raw_customers
- name: raw_orders
6.3 Update the models
In this section, we will update the code to:
- Clean the data: rename columns to be more self-explanatory
- Count the orders by user and calculate the first and last order date
- Create an output table which contains the aggregated data together with the customer information
Create a new model called customer_orders.sql
The name of the file will be used by dbt as the table or view name in your database.
{{ config(materialized='external', location='s3://datafy-dp-samples-ympfsg/model/customer_orders.parquet') }}
with customers as (
select
id as customer_id,
first_name,
last_name
from {{ source('external_source', 'raw_customers') }}
),
orders as (
select
id as order_id,
user_id as customer_id,
order_date,
status
from {{ source('external_source', 'raw_orders') }}
),
customer_orders as (
select
customer_id,
min(order_date) as first_order_date,
max(order_date) as most_recent_order_date,
count(order_id) as number_of_orders
from orders
group by 1
),
final as (
select
customers.customer_id,
customers.first_name,
customers.last_name,
customer_orders.first_order_date,
customer_orders.most_recent_order_date,
coalesce(customer_orders.number_of_orders, 0) as number_of_orders
from customers left join customer_orders using (customer_id)
)
select * from final
Do not forget to replace the S3 bucket placeholder <conveyor_demo_XYZ>
in the location property on line 1 with the actual bucket name.
6.4 Add dbt tests
In order to validate that the customer_order
model is created correctly, we will define a couple of tests.
Create a schema.yml
file in the models/example
folder with the following content:
version: 2
models:
- name: customer_orders
description: "Resulting table joining the customer and order data"
columns:
- name: customer_id
description: "The primary key for this table"
tests:
- unique
- not_null
- name: first_name
description: "The first name of the customer"
tests:
- not_null
- name: last_name
description: "The last name of the customer"
tests:
- not_null
6.5 Build and deploy
Now, re-build the project and deploy to your environment.
conveyor build
conveyor deploy --env $ENVIRONMENT_NAME --wait
6.6 Re-run the task
The initial deployment of your project ran with generated data. We will instruct Airflow to re-run with the updated code that uses the sample data from S3.
In the Conveyor UI navigate to your environment and open Airflow.
Navigate to your project Dag and re-trigger the dag by clicking the >
and selecting trigger dag
in the Airflow UI.
You should see that the task succeeds.