Configure Apache Iceberg for Spark / PySpark jobs

Description

Apache Iceberg is an open table format for analytic datasets. It can be used in Spark applications and integrates with Hive and Glue when working on AWS. This guide describes how to configure your spark application in order to use Apache Iceberg.

How to do it

The latest Conveyor Spark images have support for Apache Iceberg. These images contain all the necessary dependencies in order to start using Apache Iceberg.

Iceberg configuration properties

Before you can read/write data to Iceberg tables from your Spark application, you need to configure Iceberg. An example of the necessary configuration to use Iceberg with Glue in a PySpark application is the following:

catalog_name: str = "glueCatalog"
iceberg_config = {
    f"spark.sql.catalog.{catalog_name}": "org.apache.iceberg.spark.SparkCatalog",
    f"spark.sql.catalog.{catalog_name}.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog",
    f"spark.sql.catalog.{catalog_name}.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
    f"spark.sql.catalog.{catalog_name}.warehouse": "s3://<some-bucket-name>",
    f"spark.sql.catalog.{catalog_name}.http-client.type": "apache",
    "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
    "spark.sql.defaultCatalog": f"{catalog_name}",
}

Iceberg supports multiple catalogs, which is why many properties can be configured per catalog. In this example we have only one catalog, which is called the GlueCatalog, and which contains multiple databases with their respective tables.

For more details on the Catalog configuration options, please have a look at the Iceberg documentation.

important

By setting spark.sql.catalog.{catalog_name}.http-client.type to value apache, we ensure that Iceberg uses the packaged http client. If you do not set this property, the Spark application will fail because Iceberg will try to use an http client that does not exist in our Docker image.

Use the configuration properties when creating a Spark Session

The last step is to specify the previously defined configuration properties for your Spark Session. One way is to specify these properties programmatically, when creating your Spark Session:

spark_builder = SparkSession.builder.appName("some app name").enableHiveSupport()
for key, val in iceberg_config.items():
  spark_builder.config(key, val)

spark_session = spark_builder.getOrCreate()

Another option is to set these properties in the ConveyorSparkSubmitOperatorV2 immediately as follows:

from conveyor.operators import ConveyorSparkSubmitOperatorV2

ConveyorSparkSubmitOperatorV2(
        dag=dag,
        task_id="task_id1",
        num_executors=1,
        driver_instance_type='mx_small',
        executor_instance_type='mx_small',
        aws_role=role,
        conf={
            "spark.sql.catalog.glueCatalog": "org.apache.iceberg.spark.SparkCatalog",
            "spark.sql.catalog.glueCatalog.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog",
            ...
        },
        application="local:///<path-to-python-file>.py",
    )

Scala Spark configuration

In case you want to reference the Iceberg artefacts as provided dependencies in your build tool, you should make sure to align the artefact version to those packaged with the container image that you are using.

Description​

How to do it​

Iceberg configuration properties​

Use the configuration properties when creating a Spark Session​

Scala Spark configuration​

Description

How to do it

Iceberg configuration properties

Use the configuration properties when creating a Spark Session

Scala Spark configuration