Skip to main content

Publish Airflow lineage to DataHub

Enabling DataHub lineage for all dags

Conveyor supports DataHub integration out of the box. For information about DataHub, look here. To start integrating Airflow with your DataHub cluster, you have to do two things:

Creating the Airflow DataHub connection

You have to create an Airflow DataHub rest connection. The default name used by the DataHub plugin is datahub_rest_default.

You can configure the connection in the Airflow UI by going to admin, connections and pressing the plus button. Your configuration should look similar to this:

info

Make sure that you configure the DataHub metadata service (also known as gms) as the server endpoint. If you deploy DataHub using their helm chart, the gms backend can be reached as follows: https://<hostname>/api/gms

Your password should be a DataHub auth token.

Enabling the DataHub integration

Once the connection is set up, you can activate the datahub integration on an environment. You can either configure the DataHub integration using the Conveyor web app, CLI or Terraform.

Configuration options

When enabling datahub integration, there exist 3 configuration options that can be enabled:

  • capture ownership: extracts the owner from the dag configuration and captures it as a DataHub corpuser
  • capture tags: extracts the tags from the dag configuration and captures it as DataHub tags
  • graceful exceptions: if enabled the exception stacktrace will be supressed and only the exception message will be visible in the logs.

UI

On the respective environment, click on the settings tab and you will see the following screen:

Toggle the DataHub integration and change the settings where necessary. Finally, persist your changes by clicking on the save button.

CLI

Once the connection is set, you can enable the integration using the conveyor environment update command:

conveyor environment update --name ENV_NAME --airflow-datahub-integration-enabled=true --deletion-protection=false

After enabling the integration, every task running in that specific Airflow environment should automatically send updates to DataHub.

Terraform

A final way to configure DataHub integration is to use the conveyor_environment resource in Terraform. For more details, have a look at the environment resource.