Skip to main content

Using AWS Glue with Spark 4.x

Spark 4.x is a major release with several breaking changes. One of these changes is that the Hive Metastore (HMS) implementation for AWS Glue does not work anymore. AWS has not communicated about this, and it seems that they are not planning to re-implement the Glue HMS for Spark 4.x. As a consequence, reading and writing tables in Glue using Spark 4.x became more complicated.

Many organizations use AWS Glue as their metastore, allowing them to query the data using Athena and other AWS services. Additionally, Glue makes it easy to group tables of which the data is separated across many different S3 buckets.

Scenarios

There are three main scenarios that we see when working with Glue and Spark 4.x:

  1. Using Iceberg tables with Glue: This is the recommended way to use Glue with Spark 4.x. Iceberg has a built-in Glue catalog implementation that works out of the box with Spark 4.x. This allows you to read and write Iceberg tables using Glue as the metastore.
  2. Using Delta tables with Glue: Delta Lake does not have a built-in Glue catalog implementation. However, you can still create Delta tables in Glue by first writing the data to S3, generating the symlink_format_manifest files and then registering a Glue crawler to create the Delta table in Glue based on the symlink manifest files. For reading the Delta tables, you will need to know the S3 path of the table, which you can get from the Glue table metadata. This approach did not change between Spark 3.5.x and 4.x but it is more cumbersome than using Iceberg.
  3. Using plain Parquet tables with Glue: Here the situation is similar to Delta tables. You can write the Parquet files to S3 and then use a Glue crawler to create the table in Glue. For reading the Parquet tables, you will need to know the S3 path of the table, which you can get from the Glue table metadata. This is very different from Spark 3.5.x where you could simply read the table using the table name.

Recommendations for using AWS Glue with Spark 4.x

If you want to continue using AWS Glue as your metastore with Spark 4.x, we recommend using Iceberg tables for all new development. Iceberg has a built-in AWS Glue catalog implementation that works out of the box in Spark 4.x.

For already existing AWS Glue tables that are not in Iceberg format, we recommend reading/writing to them using Spark 3.5.x. Migrating these tables will require quite some effort without much benefit, so it is best to leave them as they are.

Examples

Using Iceberg tables with Glue

The following example shows how to read and write Iceberg tables using Glue as the metastore.

from pyspark.sql import SparkSession
catalog_name: str = "default"
iceberg_config = {
f"spark.sql.catalog.{catalog_name}": "org.apache.iceberg.spark.SparkCatalog",
f"spark.sql.catalog.{catalog_name}.warehouse": f"s3://{bucket}",
f"spark.sql.catalog.{catalog_name}.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog",
f"spark.sql.catalog.{catalog_name}.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
"spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
"spark.sql.defaultCatalog": f"{catalog_name}",
}

spark_builder = SparkSession.builder.appName("Iceberg with Glue").enableHiveSupport()
for key, val in iceberg_config.items():
spark_builder.config(key, val)

spark = spark_builder.getOrCreate()
# Read an Iceberg table
df = spark.table(f"default.my_database.my_table")

# Write to an Iceberg table
df.writeTo(f"default.my_database.my_new_table").createOrReplace()

Using Delta tables with Glue

The following basic example shows how to read and write a Delta table using Glue as the metastore.

from pyspark.sql import SparkSession
from delta.tables import DeltaTable

spark_builder = SparkSession.builder.appName("Delta with Glue").enableHiveSupport()

spark = spark_builder.getOrCreate()
# Read an Delta table from s3
input_delta_table = DeltaTable.forPath(spark, f"s3://<some-bucket-name>/path/to/delta/table")

# Write to a Delta table
df.write.format("delta").mode("overwrite").option("overwriteSchema", "true")
.save(f"s3://<some-bucket-name>/path/to/delta/table")
output_delta_table = DeltaTable.forPath(spark, f"s3://<some-bucket-name>/path/to/delta/table")
output_delta_table.generate("symlink_format_manifest")

Using Parquet tables with Glue

The following basic example shows how to read and write a Parquet tables using Glue as the metastore.

from pyspark.sql import SparkSession
import boto3

spark_builder = SparkSession.builder.appName("Parquet with Glue").enableHiveSupport()

spark = spark_builder.getOrCreate()

def get_glue_table_location(database_name: str, table_name: str, region_name: str = "eu-west-1") -> str:
glue = boto3.client("glue", region_name=region_name)
response = glue.get_table(DatabaseName=database_name, Name=table_name)
location = response["Table"]["StorageDescriptor"].get("Location")
return location

# Read a Parquet table from s3
s3_path = get_glue_table_location("my_database", "my_parquet_table")
df = spark.read.parquet(s3_path)

# Write to a parquet table
df.write.mode("overwrite").parquet("s3://<some-bucket-name>/path/")