Publishing lineage to DataHub
Enabling DataHub lineage for all dags
Conveyor provides an integration with DataHub out of the box. To start integrating Airflow with your DataHub cluster, you have to do two things:
Creating the Airflow DataHub connection
You have to create an Airflow DataHub rest connection.
The default name used by the DataHub plugin is datahub_rest_default
.
You can configure the connection in the Airflow UI by going to admin
, connections
and pressing the plus button.
Your configuration should look similar to the screenshot below. Your password should be a DataHub auth token.
Make sure that you configure the DataHub metadata service (also known as gms) as the server endpoint.
If you deploy DataHub using their helm chart, the gms backend can be reached as follows: https://<hostname>/api/gms
Enabling the DataHub integration
Once the connection is set up, you can activate the datahub integration on an environment. You can either configure the DataHub integration using the Conveyor web app, CLI or Terraform.
Configuration options
When enabling datahub integration, there exist 3 configuration options that can be enabled:
- capture ownership: extracts the owner from the dag configuration and captures it as a DataHub corpuser
- capture tags: extracts the tags from the dag configuration and captures it as DataHub tags
- graceful exceptions: if enabled the exception stacktrace will be supressed and only the exception message will be visible in the logs.
UI
On the respective environment, click on the settings tab and you will see the following screen:
Toggle the DataHub integration and change the settings where necessary. Finally, persist your changes by clicking on the save button.
CLI
Once the connection is set, you can enable the integration using the conveyor environment update command:
conveyor environment update --name ENV_NAME --airflow-datahub-integration-enabled=true --deletion-protection=false
After enabling the integration, every task running in that specific Airflow environment should automatically send updates to DataHub.
Terraform
A final way to configure DataHub integration is to use the conveyor_environment
resource in Terraform.
For more details, have a look at the environment resource.