Skip to main content

6. Run against real data

Everything is now in place to update the code and run the project against real data.

6.1 Remove existing files

Remove all the existing files in the models/example folder such that we can start from a clean slate.

6.2 Configure source data

Duckdb supports loading data that is not in the database, but that is available as an external source. In this tutorial, we will use 2 csv files, containing customer and order data, that we exposed through a public S3 bucket. If you are logged in into your aws account in the browser, you can take a look [here](https://s3.console.aws.amazon.com/s3/buckets/datafy-cp-artifacts?region=eu-west-1&prefix=conveyor-samples/dbt/sources/&showversions=false.

In order for dbt to know how to locate this source data, you need to add the following sources.yml file in the models/example directory.

version: 2
sources:
- name: external_source
meta:
external_location: "s3://datafy-cp-artifacts/conveyor-samples/dbt/sources/{name}.csv"
tables:
- name: raw_customers
- name: raw_orders

6.3 Update the models

In this section, we will update the code to:

  1. Clean the data: rename columns to be more self-explanatory
  2. Count the orders by user and calculate the first and last order date
  3. Create an output table which contains the aggregated data together with the customer information

Create a new model called customer_orders.sql

info

The name of the file will be used by dbt as the table or view name in your database.

{{ config(materialized='external', location='s3://datafy-dp-samples-ympfsg/model/customer_orders.parquet') }}
with customers as (
select
id as customer_id,
first_name,
last_name
from {{ source('external_source', 'raw_customers') }}
),
orders as (
select
id as order_id,
user_id as customer_id,
order_date,
status
from {{ source('external_source', 'raw_orders') }}
),
customer_orders as (
select
customer_id,
min(order_date) as first_order_date,
max(order_date) as most_recent_order_date,
count(order_id) as number_of_orders
from orders
group by 1
),
final as (
select
customers.customer_id,
customers.first_name,
customers.last_name,
customer_orders.first_order_date,
customer_orders.most_recent_order_date,
coalesce(customer_orders.number_of_orders, 0) as number_of_orders
from customers left join customer_orders using (customer_id)
)

select * from final
important

Do not forget to replace the S3 bucket placeholder <conveyor_demo_XYZ> in the location property on line 1 with the actual bucket name.

6.4 Add dbt tests

In order to validate that the customer_order model is created correctly, we will define a couple of tests. Create a schema.yml file in the models/example folder with the following content:

version: 2

models:
- name: customer_orders
description: "Resulting table joining the customer and order data"
columns:
- name: customer_id
description: "The primary key for this table"
tests:
- unique
- not_null
- name: first_name
description: "The first name of the customer"
tests:
- not_null
- name: last_name
description: "The last name of the customer"
tests:
- not_null

6.5 Build and deploy

Now, re-build the project and deploy to your environment.

conveyor build
conveyor deploy --env $ENVIRONMENT_NAME --wait

6.6 Re-run the task

The initial deployment of your project ran with generated data. We will instruct Airflow to re-run with the updated code that uses the sample data from S3.

In the Conveyor UI navigate to your environment and open Airflow. Navigate to your project Dag and re-trigger the dag by clicking the > and selecting trigger dag in the Airflow UI. You should see that the task succeeds.