6. Run against real data

Everything is now in place to update the code and run the project against real data.

6.1 Remove existing files

Remove all the existing files in the models/example folder such that we can start from a clean slate.

6.2 Configure source data

Duckdb supports loading data that is not in the database, but that is available as an external source. In this tutorial, we will use 2 csv files, containing customer and order data, that we exposed through a public S3 bucket. If you are logged in into your aws account in the browser, you can take a look at the S3 console.

In order for dbt to know how to locate this source data, you need to add the following sources.yml file in the models/example directory.

version: 2
sources:
  - name: external_source
    meta:
      external_location: "s3://datafy-cp-artifacts/conveyor-samples/dbt/sources/{name}.csv"
    tables:
      - name: raw_customers
      - name: raw_orders

6.3 Update the models

In this section, we will update the code to:

Clean the data: rename columns to be more self-explanatory
Count the orders by user and calculate the first and last order date
Create an output table which contains the aggregated data together with the customer information

Create a new model called customer_orders.sql

info

The name of the file will be used by dbt as the table or view name in your database.

{{ config(materialized='external', location='s3://datafy-dp-samples-ympfsg/model/customer_orders.parquet') }}
with customers as (
    select
        id as customer_id,
        first_name,
        last_name
    from {{ source('external_source', 'raw_customers') }}
),
orders as (
    select
        id as order_id,
        user_id as customer_id,
        order_date,
        status
    from {{ source('external_source', 'raw_orders') }}
),
customer_orders as (
    select
        customer_id,
        min(order_date) as first_order_date,
        max(order_date) as most_recent_order_date,
        count(order_id) as number_of_orders
    from orders
    group by 1
),
final as (
    select
        customers.customer_id,
        customers.first_name,
        customers.last_name,
        customer_orders.first_order_date,
        customer_orders.most_recent_order_date,
        coalesce(customer_orders.number_of_orders, 0) as number_of_orders
    from customers left join customer_orders using (customer_id)
)

select * from final

important

Do not forget to replace the S3 bucket placeholder <conveyor_demo_XYZ> in the location property on line 1 with the actual bucket name.

6.4 Add dbt tests

In order to validate that the customer_order model is created correctly, we will define a couple of tests. Create a schema.yml file in the models/example folder with the following content:

version: 2

models:
  - name: customer_orders
    description: "Resulting table joining the customer and order data"
    columns:
      - name: customer_id
        description: "The primary key for this table"
        tests:
          - unique
          - not_null
      - name: first_name
        description: "The first name of the customer"
        tests:
          - not_null
      - name: last_name
        description: "The last name of the customer"
        tests:
          - not_null

6.5 Build and deploy

Now, re-build the project and deploy to your environment.

conveyor build

conveyor deploy --env $ENVIRONMENT_NAME --wait

6.6 Re-run the task

The initial deployment of your project ran with generated data. We will instruct Airflow to re-run with the updated code that uses the sample data from S3.

In the Conveyor UI navigate to your environment and open Airflow. Navigate to your project Dag and re-trigger the dag by clicking the > and selecting trigger dag in the Airflow UI. You should see that the task succeeds.

6.1 Remove existing files​

6.2 Configure source data​

6.3 Update the models​

6.4 Add dbt tests​

6.5 Build and deploy​

6.6 Re-run the task​