Skip to main content

Publishing lineage to DataHub

Enabling DataHub lineage for all dags

Conveyor provides an integration with DataHub out of the box. To start integrating Airflow with your DataHub cluster, you have to do two things:

info

As of Airflow 3.0, we use the openlineage plugin instead of the DataHub plugin for pushing events to Datahub. We switched because openlineage is the standard for metadata and lineage collection, the plugin is more actively maintained, and it supports Airflow 3.

Creating the Airflow DataHub connection

You have to create a Generic connection. The default name used for the connection is datahub_rest_default.

You can configure the connection in the Airflow UI by going to admin, connections and pressing the plus button. Your configuration should look similar to the screenshot below. Your password should be a DataHub auth token.

info

Make sure that you configure the DataHub metadata service (also known as gms) as the server endpoint. If you deploy DataHub using their helm chart, the gms backend can be reached as follows: https://<hostname>/api/gms

Airflow connection for DataHub

Enabling the DataHub integration

Once the connection is set up, you can activate the datahub integration on an environment. You can either configure the DataHub integration using the Conveyor web app, CLI or Terraform.

Configuration options

warning

These configuration options are only supported on Airflow 2.

When enabling datahub integration, there exist 3 configuration options that can be enabled:

  • capture ownership: extracts the owner from the dag configuration and captures it as a DataHub corpuser
  • capture tags: extracts the tags from the dag configuration and captures it as DataHub tags
  • graceful exceptions: if enabled the exception stacktrace will be supressed and only the exception message will be visible in the logs.

UI

On the respective environment, click on the settings tab, and you will see the following screen:

DataHub settings in the Conveyor UI

Toggle the DataHub integration and change the settings where necessary. Finally, persist your changes by clicking on the save button.

CLI

Once the connection is set, you can enable the integration using the conveyor environment update command:

conveyor environment update --name ENV_NAME --airflow-datahub-integration-enabled=true --deletion-protection=false

After enabling the integration, every task running in that specific Airflow environment should automatically send updates to DataHub.

Terraform

A final way to configure DataHub integration is to use the conveyor_environment resource in Terraform. For more details, have a look at the environment resource.

Dataset awareness

By default Airflow's built-in Assets configured in inlets and outlets are only visible in Datahub as a property on your task. To show datasets as first-class entities in Datahub, you have to use the Opelineage Dataset class when defining your inlets and outlets. Here is an example of how to define a dataset using Openlineage:

from openlineage.client.event_v2 import Dataset

ConveyorContainerOperatorV2(
...
dag=dag,
inlets=[Dataset(namespace="s3://input-bucket", name="my-data/raw")],
outlets=[Dataset(namespace="s3://input-bucket", name="my-data/normalized")],
)