5. Add and configure project resources
The project we deployed generates data, transforms it and writes the output to the logs. We will take it a step further by integrating with external systems:
- reading data from a public datasource available on S3
- transforming the data
- publish the data to our own S3 bucket
- register it in the AWS Glue Catalog
- query the data using Amazon Athena
In this part of the tutorial, we will create external resources and configure access.
5.1. Create external resources
In order to support the integration, you need to create some resources in your aws account.
5.1.1. Create an S3 bucket
Create a new S3 bucket
named conveyor_demo_XYZ
where XYZ should be replaced by a random string.
5.1.2. Create a Glue database
Create a new Glue database with the location property pointing to the bucket created above.
5.2. Configure access to external resources
AWS IAM roles is a permission system of granting access to AWS Services like S3 and Glue. We will create an AWS IAM role that can be used by Conveyor to access the created S3 bucket and the Glue database:
- Conveyor managed
- Externally managed
Some Conveyor installations allow AWS Roles to be created from within a project. We call that project resources. Conveyor uses an open-source infrastructure as code software tool named Terraform.
Including Project resources is not supported on Azure. Creating the necessary external resources for a project should be done outside of a Conveyor project.
The command is executed in the project folder. For this tutorial, select the following options for your project (other options should be left on their default settings):
conveyor template apply --template resource/aws/spark-iam-role-glue
Specify the following variables when asked. The rest you can set to default.
- resource_name:
<<Insert the name of your project>>
- bucket_name:
<<Insert the name of the S3 bucket>>
_ - database_name:
<<Insert the name of the Glue database>>
This will have created a new resources
folder containing resource definitions to create a role
and grant access to the bucket and Glue database.
We will have to make a small modification to allow the role to read the OpenAQ data.
Update the policy document by adding a statement to give S3 access on the openaq-fetches
bucket.
...
data "aws_iam_policy_document" "spark_iam_role_glue" {
statement {
actions = [
"s3:*"
]
resources = [
"arn:aws:s3:::openaq-fetches",
"arn:aws:s3:::openaq-fetches/*"
]
effect = "Allow"
}
...
Have a look at the generated code and explore what IAM role would be created. You will need this for the next step.
Some Conveyor installations require AWS Roles to be created outside of the project. In that case the AWS roles are managed centrally through Terraform, Cloudformation templates, etc and the defined role names are referenced in the role property of the Airflow DAG definition. For the sake of this tutorial use Conveyor managed for now.
5.3. Configure Airflow to use the correct identity
Update the workflow definition file dags/$PROJECT_NAME.py
and change the role variable to the one we defined in the
previous step. Here is an example:
- AWS
- Azure
from conveyor.operators import ConveyorSparkSubmitOperatorV2
task = ConveyorSparkSubmitOperatorV2(
task_id = "my_task",
aws_role = "john-{{ macros.conveyor.env() }}",
...
)
from conveyor.operators import ConveyorSparkSubmitOperatorV2
task = ConveyorSparkSubmitOperatorV2(
task_id = "my_task",
azure_application_client_id = "john-{{ macros.conveyor.env() }}",
...
)