Configure Apache Iceberg for Spark / PySpark jobs
Description
Apache Iceberg is an open table format for analytic datasets. It can be used in Spark applications and integrates with Hive and Glue when working on AWS. This guide describes how to configure your spark application in order to use Apache Iceberg.
How to do it
The latest Conveyor Spark images have support for Apache Iceberg. These images contain all the necessary dependencies in order to start using Apache Iceberg.
Iceberg configuration properties
Before you can read/write data to Iceberg tables from your Spark application, you need to configure Iceberg. An example of the necessary configuration to use Iceberg with Glue in a PySpark application is the following:
catalog_name: str = "glueCatalog"
iceberg_config = {
f"spark.sql.catalog.{catalog_name}": "org.apache.iceberg.spark.SparkCatalog",
f"spark.sql.catalog.{catalog_name}.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog",
f"spark.sql.catalog.{catalog_name}.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
f"spark.sql.catalog.{catalog_name}.warehouse": "s3://<some-bucket-name>",
f"spark.sql.catalog.{catalog_name}.http-client.type": "apache",
"spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
"spark.sql.defaultCatalog": f"{catalog_name}",
}
Iceberg supports multiple catalogs, which is why many properties can be configured per catalog.
In this example we have only one catalog, which is called the GlueCatalog
,
and which contains multiple databases with their respective tables.
For more details on the Catalog configuration options, please have a look at the Iceberg documentation.
By setting spark.sql.catalog.{catalog_name}.http-client.type
to value apache
,
we ensure that Iceberg uses the packaged http client.
If you do not set this property, the Spark application will fail
because Iceberg will try to use an http client that does not exist in our Docker image.
Use the configuration properties when creating a Spark Session
The last step is to specify the previously defined configuration properties for your Spark Session. One way is to specify these properties programmatically, when creating your Spark Session:
spark_builder = SparkSession.builder.appName("some app name").enableHiveSupport()
for key, val in iceberg_config.items():
spark_builder.config(key, val)
spark_session = spark_builder.getOrCreate()
Another option is to set these properties in the ConveyorSparkSubmitOperatorV2
immediately as follows:
from conveyor.operators import ConveyorSparkSubmitOperatorV2
ConveyorSparkSubmitOperatorV2(
dag=dag,
task_id="task_id1",
num_executors=1,
driver_instance_type='mx_small',
executor_instance_type='mx_small',
aws_role=role,
conf={
"spark.sql.catalog.glueCatalog": "org.apache.iceberg.spark.SparkCatalog",
"spark.sql.catalog.glueCatalog.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog",
...
},
application="local:///<path-to-python-file>.py",
)
Scala Spark configuration
In case you want to reference the Iceberg artefacts as provided dependencies in your build tool, you should make sure to align the artefact version to those packaged with the container image that you are using.