Skip to main content

5. Add and configure project resources

The project we deployed generates data, transforms it and writes the output to the logs. We will take it a step further by integrating with external systems:

  • reading data from a public datasource available on S3
  • transforming the data
  • publish the data to our own S3 bucket
  • register it in the AWS Glue Catalog
  • query the data using Amazon Athena

In this part of the tutorial, we will create external resources and configure access.

5.1 Create external resources

In order to support the integration, you need to create some resources in your aws account.

5.1.1 Create an S3 bucket

Create a new S3 bucket named conveyor_demo_XYZ where XYZ should be replaced by a random string.

5.1.2 Create a Glue database

Create a new Glue database with the location property pointing to the bucket created above.

5.2 Configure access to external resources

AWS IAM roles is a permission system for granting access to AWS Services like S3 and Glue. We will create an AWS IAM role that can be used by Conveyor to access the created S3 bucket and the Glue database:

Some Conveyor installations allow AWS Roles to be created from within a project. We call that project resources. Conveyor uses an open-source infrastructure as code software tool named Terraform.

caution

Including Project resources is not supported on Azure. Creating the necessary external resources for a project should be done outside of a Conveyor project.

The command is executed in the project folder. For this tutorial, select the following options for your project (other options should be left on their default settings):

conveyor template apply --template resource/aws/spark-iam-role-glue

Specify the following variables when asked. The rest you can set to default.

  • resource_name: <<Insert the name of your project>>
  • bucket_name: <<Insert the name of the S3 bucket>>_
  • database_name: <<Insert the name of the Glue database>>

This will have created a new resources folder containing resource definitions to create a role and grant access to the bucket and Glue database.

We will have to make a small modification to allow the role to read the OpenAQ data. Update the policy document by adding a statement to give S3 access on the openaq-fetches bucket.

...

data "aws_iam_policy_document" "spark_iam_role_glue" {
statement {
actions = [
"s3:*"
]
resources = [
"arn:aws:s3:::openaq-fetches",
"arn:aws:s3:::openaq-fetches/*"
]
effect = "Allow"
}

...

Have a look at the generated code and explore what IAM role would be created. You will need this for the next step.

5.3 Configure Airflow to use the correct identity

Update the workflow definition file dags/$PROJECT_NAME.py and change the role variable to the one we defined in the previous step. Here is an example:

from conveyor.operators import ConveyorSparkSubmitOperatorV2

task = ConveyorSparkSubmitOperatorV2(
task_id = "my_task",
aws_role = "john-{{ macros.conveyor.env() }}",
...
)