Skip to main content

5. Add and configure project resources

The project we deployed generates data, transforms it all in the same task. The resulting tables are only persisted in the database file, which is only readable by Duckdb. Now we will take it a step further by integrating with an external system such that our data can be used by other components as well:

  • reading data from a public datasource available on S3
  • cleaning the data
  • publish the data to our own S3 bucket

In this part of the tutorial, we will create external resources and configure access.

5.1 Create external resources

In order to support the integration, you need to create some resources in your aws account.

5.1.1 Create an S3 bucket

Create a new S3 bucket named conveyor_demo_XYZ where XYZ should be replaced by a random string.

5.2 Configure access to external resources

AWS IAM roles is a permission system of granting access to AWS Services like S3. We will create an AWS IAM role that can be used by Conveyor to access the created S3 bucket and the AWS Glue database:

Some Conveyor installations allow AWS Roles to be created from within a project. We call that project resources. Conveyor uses an open-source infrastructure as code software tool named Terraform.

caution

Including Project resources is not supported on Azure. Creating the necessary external resources for a project should be done outside of a Conveyor project.

The command is executed in the project folder. For this tutorial, select the following options for your project (other options should be left on their default settings):

conveyor template apply --template resource/aws/dbt-iam-role-s3

Specify the following variables when asked. The rest you can set to default.

  • resource_name: <<Insert the name of your project>>
  • bucket_name: <<Insert the name of the S3 bucket>>
  • project_name: <<Insert the name of your project>>

This will have created a new resources folder containing resource definitions to create a role and grant access to the S3 bucket.

We will have to make a small modification to allow the role to read the raw customer and order data, which we expose in a public S3 bucket. Update the policy document by adding a statement to give S3 access on the datafy-cp-artifacts bucket.

...
data "aws_iam_policy_document" "default" {
statement {
actions = [
"s3:List*", "s3:Get*"
]
resources = [
"arn:aws:s3:::datafy-cp-artifacts*",
]
effect = "Allow"
}
}

...

Have a look at the generated code and explore what IAM role would be created. You will need this for the next step.

5.3 Configure Airflow to use the correct identity

Update the workflow definition file dags/$PROJECT_NAME.py and change the role variable to the one we defined in the previous step. Here is an example:

...

aws_role="john-{{ macros.conveyor.env() }}"

...

5.4 Make sure dbt uses the AWS credentials

The last step is to specify in the dbt profile that it should use the AWS credentials available in the docker container. This is done by adding the following line in the profiles.yml file as a property under the dev output. The beginning of the file should now look as follows:

default:
outputs:
dev:
type: duckdb
path: /tmp/dbt.duckdb
threads: 1
extensions:
- httpfs
- parquet
use_credential_provider: aws