5. Add and configure project resources
The project we deployed generates data, transforms it all in the same task. The resulting tables are only persisted in the database file, which is only readable by Duckdb. Now we will take it a step further by integrating with an external system such that our data can be used by other components as well:
- reading data from a public datasource available on S3
- cleaning the data
- publish the data to our own S3 bucket
In this part of the tutorial, we will create external resources and configure access.
5.1 Create external resources
In order to support the integration, you need to create some resources in your aws account.
5.1.1 Create an S3 bucket
Create a new S3 bucket
named conveyor_demo_XYZ
where XYZ should be replaced by a random string.
5.2 Configure access to external resources
AWS IAM roles is a permission system of granting access to AWS Services like S3. We will create an AWS IAM role that can be used by Conveyor to access the created S3 bucket and the AWS Glue database:
- Conveyor managed
- Externally managed
Some Conveyor installations allow AWS Roles to be created from within a project. We call that project resources. Conveyor uses an open-source infrastructure as code software tool named Terraform.
Including Project resources is not supported on Azure. Creating the necessary external resources for a project should be done outside of a Conveyor project.
The command is executed in the project folder. For this tutorial, select the following options for your project (other options should be left on their default settings):
conveyor template apply --template resource/aws/dbt-iam-role-s3
Specify the following variables when asked. The rest you can set to default.
- resource_name:
<<Insert the name of your project>>
- bucket_name:
<<Insert the name of the S3 bucket>>
- project_name:
<<Insert the name of your project>>
This will have created a new resources
folder containing resource definitions to create a role and grant access to the S3 bucket.
We will have to make a small modification to allow the role to read the raw customer and order data, which we expose in a public S3 bucket.
Update the policy document by adding a statement to give S3 access on the datafy-cp-artifacts
bucket.
...
data "aws_iam_policy_document" "default" {
statement {
actions = [
"s3:List*", "s3:Get*"
]
resources = [
"arn:aws:s3:::datafy-cp-artifacts*",
]
effect = "Allow"
}
}
...
Have a look at the generated code and explore what IAM role would be created. You will need this for the next step.
Some Conveyor installations require AWS Roles to be created outside of the project. In that case the AWS roles are managed centrally through Terraform, Cloudformation templates, etc and the defined role names are referenced in the role property of the Airflow DAG definition. For the sake of this tutorial use Conveyor managed for now.
5.3 Configure Airflow to use the correct identity
Update the workflow definition file dags/$PROJECT_NAME.py
and change the role variable to the one we defined in the
previous step. Here is an example:
- AWS
- Azure
...
aws_role="john-{{ macros.conveyor.env() }}"
...
...
azure_application_client_id="john-{{ macros.conveyor.env() }}"
...
5.4 Make sure dbt uses the AWS credentials
The last step is to specify in the dbt profile that it should use the AWS credentials available in the docker container.
This is done by adding the following line in the profiles.yml
file as a property under the dev output.
The beginning of the file should now look as follows:
default:
outputs:
dev:
type: duckdb
path: /tmp/dbt.duckdb
threads: 1
extensions:
- httpfs
- parquet
use_credential_provider: aws