Spark

Conveyor provides you with container base images that work on both AWS and Azure. These images allow you to use AWS Glue as the Hive metastore when running your workloads on Conveyor. On Azure, we make sure that Spark uses the correct authentication flows with Azure Active Directory.

For added ease of use, we also provide out of the box support for Apache Iceberg and Delta Lake on newer Spark versions.

Latest images

The latest-released images for the supported minor Spark versions are:

public.ecr.aws/dataminded/spark-k8s-glue:v4.0.1-hadoop-3.4.2-v3
public.ecr.aws/dataminded/spark-k8s-glue:v3.5.6-hadoop-3.3.6-v1
public.ecr.aws/dataminded/spark-k8s-glue:v3.5.6-2.13-hadoop-3.3.6-v1
public.ecr.aws/dataminded/spark-k8s-glue:v3.4.4-hadoop-3.3.6-v1
public.ecr.aws/dataminded/spark-k8s-glue:v3.4.4-2.13-hadoop-3.3.6-v1
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.4-hadoop-3.3.5-v1
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.4-2.13-hadoop-3.3.5-v1
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.4-hadoop-3.3.5-v1
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.4-2.13-hadoop-3.3.5-v1
public.ecr.aws/dataminded/spark-k8s-glue:v3.1.3-hadoop-3.3.5-v1
public.ecr.aws/dataminded/spark-k8s-glue:v3.0.3-hadoop-3.3.4-v8

info

We deprecated the Azure-specific images and integrated the Azure libraries into our standard images starting from: public.ecr.aws/dataminded/spark-k8s-glue:v3.2.1-hadoop-3.3.1-v4

Image details

The images in the tables below have been tested and are confirmed to be working.

Spark 4.0.x

There are some important changes in Spark 4.0.x that you should be aware of:

AWS SDK for Java 2 is now the default, and we do not package AWS version 1 anymore.
Scala 2.12 is not supported anymore.
At the moment only Iceberg tables can be read/written using Glue as a metastore on Spark 4.0.x. For more details read the how-to guide.
The magic committer has been enabled by default

Name	Spark	Scala	Hadoop	OpenJDK	AWS SDK v2	Python	Delta	Iceberg
public.ecr.aws/dataminded/spark-k8s-glue:v4.0.1-hadoop-3.4.2-v3	4.0.1	2.13	3.4.2	21.0.8	2.25.53	3.10.12	4.0.0	1.10.0
public.ecr.aws/dataminded/spark-k8s-glue:v4.0.1-hadoop-3.4.2-v2-preview	4.0.1	2.13	3.4.2	21.0.8	2.25.53	3.10.12	4.0.0	1.10.0
public.ecr.aws/dataminded/spark-k8s-glue:v4.0.1-hadoop-3.4.2-v1-preview	4.0.1	2.13	3.4.2	21.0.8	2.25.53	3.10.12	4.0.0	N/A
public.ecr.aws/dataminded/spark-k8s-glue:v4.0.0-hadoop-3.4.1-v1-preview	4.0.0	2.13	3.4.1	21.0.5	2.24.6	3.10.12	4.0.0	N/A

Spark 3.5.x

Name	Spark	Scala	Hadoop	OpenJDK	AWS SDK	AWS SDK v2¹	Python	MSAL	Delta	Iceberg
public.ecr.aws/dataminded/spark-k8s-glue:v3.5.6-hadoop-3.3.6-v1	3.5.6	2.12	3.3.6	17.0.15	1.12.367	2.29.52	3.10.12	1.16.2	3.3.2	1.9.1
public.ecr.aws/dataminded/spark-k8s-glue:v3.5.6-2.13-hadoop-3.3.6-v1	3.5.6	2.13	3.3.6	17.0.15	1.12.367	2.29.52	3.10.12	1.16.2	3.3.2	1.9.1
public.ecr.aws/dataminded/spark-k8s-glue:v3.5.5-hadoop-3.3.6-v1	3.5.5	2.12	3.3.6	17.0.13	1.12.367	2.29.1	3.10.12	1.16.2	3.2.1	1.7.1
public.ecr.aws/dataminded/spark-k8s-glue:v3.5.5-2.13-hadoop-3.3.6-v1	3.5.5	2.13	3.3.6	17.0.13	1.12.367	2.29.1	3.10.12	1.16.2	3.2.1	1.7.1
public.ecr.aws/dataminded/spark-k8s-glue:v3.5.4-hadoop-3.3.6-v1	3.5.4	2.12	3.3.6	17.0.13	1.12.367	2.29.1	3.10.12	1.16.2	3.2.1	1.7.1
public.ecr.aws/dataminded/spark-k8s-glue:v3.5.4-2.13-hadoop-3.3.6-v1	3.5.4	2.13	3.3.6	17.0.13	1.12.367	2.29.1	3.10.12	1.16.2	3.2.1	1.7.1
public.ecr.aws/dataminded/spark-k8s-glue:v3.5.3-hadoop-3.3.6-v1	3.5.3	2.12	3.3.6	17.0.12	1.12.367	2.27.24	3.10.12	1.16.2	3.2.1	1.6.1
public.ecr.aws/dataminded/spark-k8s-glue:v3.5.3-2.13-hadoop-3.3.6-v1	3.5.3	2.13	3.3.6	17.0.12	1.12.367	2.27.24	3.10.12	1.16.2	3.2.1	1.6.1
public.ecr.aws/dataminded/spark-k8s-glue:v3.5.2-hadoop-3.3.6-v1	3.5.2	2.12	3.3.6	17.0.9	1.12.367	2.20.162	3.10.12	1.12.0	3.2.0	1.6.1
public.ecr.aws/dataminded/spark-k8s-glue:v3.5.2-2.13-hadoop-3.3.6-v1	3.5.2	2.13	3.3.6	17.0.9	1.12.367	2.20.162	3.10.12	1.12.0	3.2.0	1.6.1
public.ecr.aws/dataminded/spark-k8s-glue:v3.5.1-hadoop-3.3.6-v2²	3.5.1	2.12	3.3.6	17.0.9	1.12.367	2.20.162	3.10.12	1.12.0	3.1.0	1.4.3
public.ecr.aws/dataminded/spark-k8s-glue:v3.5.1-2.13-hadoop-3.3.6-v2²	3.5.1	2.13	3.3.6	17.0.9	1.12.367	2.20.162	3.10.12	1.12.0	3.1.0	1.4.3
public.ecr.aws/dataminded/spark-k8s-glue:v3.5.0-hadoop-3.3.6-v2	3.5.0	2.12	3.3.6	17.0.9	1.12.367	2.20.162	3.10.12	1.12.0	3.0.0	1.4.2
public.ecr.aws/dataminded/spark-k8s-glue:v3.5.0-2.13-hadoop-3.3.6-v2	3.5.0	2.13	3.3.6	17.0.9	1.12.367	2.20.162	3.10.12	1.12.0	3.0.0	1.4.2
public.ecr.aws/dataminded/spark-k8s-glue:v3.5.0-hadoop-3.3.6-v1	3.5.0	2.12	3.3.6	17.0.8	1.12.367	/	3.10.12	1.11.2	No support³	No support⁴
public.ecr.aws/dataminded/spark-k8s-glue:v3.5.0-2.13-hadoop-3.3.6-v1	3.5.0	2.13	3.3.6	17.0.8	1.12.367	/	3.10.12	1.11.2	No support³	No support⁴

Spark 3.4.x

Name	Spark	Scala	Hadoop	OpenJDK	AWS SDK	AWS SDK v2¹	Python	MSAL	Delta	Iceberg
public.ecr.aws/dataminded/spark-k8s-glue:v3.4.4-hadoop-3.3.6-v1	3.4.4	2.12	3.3.6	17.0.13	1.12.367	2.27.24	3.10.12	1.16.2	2.4.0	1.6.1
public.ecr.aws/dataminded/spark-k8s-glue:v3.4.4-2.13-hadoop-3.3.6-v1	3.4.4	2.13	3.3.6	17.0.13	1.12.367	2.27.24	3.10.12	1.16.2	2.4.0	1.6.1
public.ecr.aws/dataminded/spark-k8s-glue:v3.4.3-hadoop-3.3.6-v1	3.4.3	2.12	3.3.6	17.0.13	1.12.367	2.27.24	3.10.12	1.16.2	2.4.0	1.6.1
public.ecr.aws/dataminded/spark-k8s-glue:v3.4.3-2.13-hadoop-3.3.6-v1	3.4.3	2.13	3.3.6	17.0.13	1.12.367	2.27.24	3.10.12	1.16.2	2.4.0	1.6.1
public.ecr.aws/dataminded/spark-k8s-glue:v3.4.1-hadoop-3.3.6-v1	3.4.1	2.12	3.3.6	17.0.7	1.12.367	2.18.41	3.10.6	1.11.2	2.4.0	1.3.0
public.ecr.aws/dataminded/spark-k8s-glue:v3.4.1-2.13-hadoop-3.3.6-v1	3.4.1	2.13	3.3.6	17.0.7	1.12.367	2.18.41	3.10.6	1.11.2	2.4.0	1.3.0

Spark 3.3.x

Name	Spark	Scala	Hadoop	OpenJDK	AWS SDK	AWS SDK v2¹	Python	MSAL	Delta	Iceberg
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.4-hadoop-3.3.5-v1	3.3.4	2.12	3.3.5	11.0.16	1.12.262	2.27.24	3.9.2	1.16.2	2.3.0	1.6.1
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.4-2.13-hadoop-3.3.5-v1	3.3.4	2.13	3.3.5	11.0.16	1.12.262	2.27.24	3.9.2	1.16.2	2.3.0	1.6.1
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.3-hadoop-3.3.5-v1	3.3.3	2.12	3.3.5	11.0.16	1.12.262	2.18.41	3.9.2	1.11.2	2.2.0	1.0.0
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.3-2.13-hadoop-3.3.5-v1	3.3.3	2.13	3.3.5	11.0.16	1.12.262	2.18.41	3.9.2	1.11.2	2.2.0	1.0.0
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.2-hadoop-3.3.5-v2	3.3.2	2.12	3.3.5	11.0.16	1.12.262	2.18.41	3.9.2	1.11.2	2.2.0	1.0.0
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.2-2.13-hadoop-3.3.5-v2	3.3.2	2.13	3.3.5	11.0.16	1.12.262	2.18.41	3.9.2	1.11.2	2.2.0	1.0.0
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.2-hadoop-3.3.4-v2	3.3.2	2.12	3.3.4	11.0.16	1.12.262	2.18.41	3.9.2	1.11.2	2.2.0	1.0.0
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.2-2.13-hadoop-3.3.4-v2	3.3.2	2.13	3.3.4	11.0.16	1.12.262	2.18.41	3.9.2	1.11.2	2.2.0	1.0.0
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.1-hadoop-3.3.4-v2	3.3.1	2.12	3.3.4	11.0.16	1.12.262	2.18.41	3.9.2	1.11.2	2.1.0	1.0.0
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.1-2.13-hadoop-3.3.4-v2	3.3.1	2.13	3.3.4	11.0.16	1.12.262	2.18.41	3.9.2	1.11.2	2.1.0	1.0.0
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.0-hadoop-3.3.4-v3	3.3.0	2.12	3.3.4	11.0.15	1.12.262	/	3.9.2	1.11.2	2.1.0	/
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.0-2.13-hadoop-3.3.4-v3	3.3.0	2.13	3.3.4	11.0.15	1.12.262	/	3.9.2	1.11.2	2.1.0	/
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.0-hadoop-3.3.1-v1	3.3.0	2.12	3.3.1	11.0.14	1.11.901	/	3.9.2	1.11.2	No support³	/
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.0-2.13-hadoop-3.3.1-v1	3.3.0	2.13	3.3.1	11.0.14	1.11.901	/	3.9.2	1.11.2	No support³	/

Spark 3.2.x

Name	Spark	Scala	Hadoop	OpenJDK	AWS SDK	AWS SDK v2¹	Python	MSAL	Delta	Iceberg
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.4-hadoop-3.3.5-v1	3.2.4	2.12	3.3.5	11.0.16	1.12.262	/	3.9.2	1.11.2	2.0.0	/
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.4-2.13-hadoop-3.3.5-v1	3.2.4	2.13	3.3.5	11.0.16	1.12.262	/	3.9.2	1.11.2	2.0.0	/
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.3-hadoop-3.3.4-v1	3.2.3	2.12	3.3.4	11.0.16	1.12.262	/	3.9.2	1.11.2	2.0.0	/
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.3-2.13-hadoop-3.3.4-v1	3.2.3	2.13	3.3.4	11.0.16	1.12.262	/	3.9.2	1.11.2	2.0.0	/
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.2-hadoop-3.3.4-v2	3.2.2	2.12	3.3.4	11.0.16	1.12.262	/	3.9.2	1.11.2	2.0.0	/
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.2-2.13-hadoop-3.3.4-v2	3.2.2	2.13	3.3.4	11.0.16	1.12.262	/	3.9.2	1.11.2	2.0.0	/
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.1-hadoop-3.3.4-v8	3.2.1	2.12	3.3.4	11.0.16	1.12.262	/	3.9.2	1.11.2	2.0.0	/
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.1-2.13-hadoop-3.3.4-v8	3.2.1	2.13	3.3.4	11.0.16	1.12.262	/	3.9.2	1.11.2	2.0.0	/
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.1-hadoop-3.3.1-v7	3.2.1	2.12	3.3.1	11.0.15	1.11.901	/	3.9.2	1.11.2	1.2.1	/
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.1-2.13-hadoop-3.3.1-v7	3.2.1	2.13	3.3.1	11.0.15	1.11.901	/	3.9.2	1.11.2	1.2.1	/
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.1-hadoop-3.3.1-v3	3.2.1	2.12	3.3.1	11.0.14	1.11.901	/	3.9.2	/	/	/
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.1-2.13-hadoop-3.3.1-v3	3.2.1	2.13	3.3.1	11.0.14	1.11.901	/	3.9.2	/	/	/
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.0-hadoop-3.3.1	3.2.0	2.12	3.3.1	11.0.12	1.11.901	/	3.9.2	/	/	/
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.0-2.13-hadoop-3.3.1-v3	3.2.0	2.13	3.3.1	11.0.12	1.11.901	/	3.9.2	/	/	/

Spark 3.1.x

Name	Spark	Scala	Hadoop	OpenJDK	AWS SDK	AWS SDK v2¹	Python	MSAL	Delta	Iceberg
public.ecr.aws/dataminded/spark-k8s-glue:v3.1.3-hadoop-3.3.5-v1	3.1.3	2.12	3.3.5	11.0.16	1.12.262	/	3.9.2	/	/	/
public.ecr.aws/dataminded/spark-k8s-glue:v3.1.3-hadoop-3.3.4-v4	3.1.3	2.12	3.3.4	11.0.16	1.12.262	/	3.9.2	/	/	/
public.ecr.aws/dataminded/spark-k8s-glue:v3.1.3-hadoop-3.3.1-v2	3.1.3	2.12	3.3.1	11.0.12	1.11.901	/	3.9.2	/	/	/
public.ecr.aws/dataminded/spark-k8s-glue:v3.1.2-hadoop-3.3.1-v3	3.1.2	2.12	3.3.1	11.0.12	1.11.901	/	3.9.2	/	/	/
public.ecr.aws/dataminded/spark-k8s-glue:v3.1.2-hadoop-3.3.1-python-3.8-v2	3.1.2	2.12	3.3.1	11.0.12	1.11.901	/	3.8.7	/	/	/
public.ecr.aws/dataminded/spark-k8s-glue:v3.1.2-hadoop-3.3.1-v2	3.1.2	2.12	3.3.1	11.0.12	1.11.901	/	3.7.3	/	/	/
public.ecr.aws/dataminded/spark-k8s-glue:v3.1.2-hadoop-3.3.0-v2	3.1.2	2.12	3.3.0	11.0.11	1.11.563	/	3.7.3	/	/	/
public.ecr.aws/dataminded/spark-k8s-glue:v3.1.1-hadoop-3.3.0	3.1.1	2.12	3.3.0	11.0.11	1.11.563	/	3.7.3	/	/	/

Spark 3.0.x

Name	Spark	Scala	Hadoop	OpenJDK	AWS SDK	AWS SDK v2¹	Python	MSAL	Delta	Iceberg
public.ecr.aws/dataminded/spark-k8s-glue:v3.0.3-hadoop-3.3.4-v8	3.0.3	2.12	3.3.4	11.0.16	1.12.262	/	3.9.2	/	/	/
public.ecr.aws/dataminded/spark-k8s-glue:v3.0.3-hadoop-3.3.1-v6	3.0.3	2.12	3.3.1	11.0.15	1.11.901	/	3.9.2	/	/	/
public.ecr.aws/dataminded/spark-k8s-glue:v3.0.2-hadoop-3.3.0-v2	3.0.2	2.12	3.3.0	11.0.11	1.11.563	/	3.7.3	/	/	/
public.ecr.aws/dataminded/spark-k8s-glue:v3.0.1-hadoop-3.3.0	3.0.1	2.12	3.3.0	8u265	1.11.563	/	3.7.3	/	/	/
public.ecr.aws/dataminded/spark-k8s-glue:3.0.0-hadoop-3.2.1	3.0.0	2.12	3.2.1	8u252	1.11.375	/	3.7.3	/	/	/

If you do need a package that matches one of the expressions above but is not part of the Conveyor base image, you can include it in your jar, provided that you use the same version as is present in the base image that you are using.

The jars provided to Spark by the base image can be found at /opt/spark/jars in the base image. You can check the list of included jars using docker run <image name> ls /opt/spark/jars.

Background information

This section contains a bit more information on the Spark 3 images. In Spark 3, the base images were changed to OpenJDK Docker images based on Debian, the previous images were based on Alpine. You can refer to the SPARK-28938 ticket for more information.

The change from Alpine to Debian makes the base image bigger, but it enables the installation of packages like pandas to use prebuilt Python wheels. This significantly speeds up the installation process.

The new images are no longer run as root by default as well. Using a command such as pip install will result in an error because of this change. To execute commands which need root permissions, you should use this pattern now:

USER 0
RUN pip install pandas
ARG spark_uid=185
USER ${spark_uid}

User 0 is the root user, the user with number 185 is the user used in the official Spark images. Running containers as a non-root user is considered a security best practice.

Upgrading Scala Spark jobs from 3.3.x to 3.4.x

For PySpark jobs, no changes are needed. For Scala Spark, you might notice an error when upgrading from 3.3.x to 3.4.x:

Exception in thread "main" java.nio.file.NoSuchFileException: /opt/spark/work-dir/YOURJARNAME.jar

This is the result of a change in Spark where files are copied over to the work-dir after cleaning it up first. You can fix this by changing your Dockerfile from:

FROM public.ecr.aws/dataminded/spark-k8s-glue:v3.4.1-hadoop-3.3.6-v1

COPY build/libs/spark-*-all.jar /opt/spark/work-dir/YOURJARNAME.jar

to:

FROM public.ecr.aws/dataminded/spark-k8s-glue:v3.4.1-hadoop-3.3.6-v1

COPY build/libs/spark-*-all.jar /opt/spark/user-files/YOURJARNAME.jar

In your DAGs, you should change the application argument from:

from conveyor.operators import ConveyorSparkSubmitOperatorV2

ConveyorSparkSubmitOperatorV2(
  ...,
  application="local:///opt/spark/work-dir/YOURJARNAME.jar",
)

to:

from conveyor.operators import ConveyorSparkSubmitOperatorV2

ConveyorSparkSubmitOperatorV2(
  ...,
  application="local:///opt/spark/user-files/YOURJARNAME.jar",
)

Hadoop 3.3.1 and the Java SDK 1.11.901

Hadoop 3.3.1 upgrades the AWS Java SDK dependency to 1.11.901, enabling support for the new authentication method, used by the Conveyor V2 Airflow operators. For information on how to use this new way of authentication, you can look at our documentation Operators: Using an AWS Role.

Spark and Python versions

You might notice that the provided Spark images don't ship with the latest Python versions. The Spark images provided by Conveyor are based on the official Spark Docker images, which themselves ship with a fixed (older) version of Python, linked to the Spark installation. Replacing the Python version in these images with a newer version would break the Spark integration, defeating the purpose of providing these images in the first place.

In case you do find yourself wanting to use a different Python version, you most likely are mixing Spark and non-Spark workloads. In this case, we recommend you to create a separate container image for the non-Spark part. If you base these images on the official Python images, you have full control over the Python version your application is using.

If you believe that your application design does not match this assumption, please contact us so we can discuss how to best support you.

Hadoop and Java SDK

The Hadoop AWS library is released with a tested Java SDK version. This tested version is also used when creating the Conveyor images. Adding additional Java SDKs with a newer version might result in strange failures. Always check that you use the same version as in the table above.

From dataminded/spark-k8s-glue:v3.0.1-hadoop-3.3.0 onwards, the AWS SDK bundle is included so that you don't need to install extra SDK versions yourself.

Packaged AWS dependencies

Starting from the image with tag dataminded/spark-k8s-glue:v3.3.1-hadoop-3.3.4-v2, we package several jars of the AWS SDK 2 next to the full bundle of AWS SDK 1. We introduced this in order to provide support for Apache Iceberg and includes the following AWS dependencies:

Glue
S3
STS
KMS
DynamoDB
Lake Formation

info

AWS SDK 1 and AWS SDK 2 use distinct package names, which is why the classes will never conflict. Both dependencies can safely be included in the same Docker image as a result.

There is some impact for Spark applications that also depend on AWS services:

If your application uses a service of AWS SDK V1 nothing changes, you do not need to add the respective jar to your Docker image. Alternatively, if you use Gradle/Maven, you can set the AWS SDK dependency to provided. This way it will use the AWS version included in the provided Spark image.
If your application depends on a service of AWS SDK V2 that is not packaged in the Docker image (as defined in Packaged AWS dependencies), you need to add it as an explicit dependency.

It is best to use the same version as the AWS SDK V2 packaged in the respective Spark image. This eliminates potential classpath issues due to version conflicts.
If your application depends on a service of AWS SDK V2 that is already packaged in the Docker image, you do not need to add it. In this case, the same approach can be used as described for AWS SDK V1.
If your application needs to use a different version of a jar existing in AWS SDK V1 or V2 than those packaged, you can best remove the packaged version and add all jars of your version. To inspect all AWS jars packaged in the docker image, you can run ls -l /opt/spark/jars | grep <aws-version> from within the Docker container.

Security patches and bugfixes

Log4j 1.x vulnerability

Spark packages log4j 1.x containing vulnerability (CVE-2021-4104), which uses the JmsAppender to execute malicious code. To mitigate this, we created Spark images that do not package the JmsAppender class. The first version of the different Spark 3 images that contain this fix are:

public.ecr.aws/dataminded/spark-k8s-glue:v3.2.0-2.13-hadoop-3.3.1-v2
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.0-hadoop-3.3.1-v2
public.ecr.aws/dataminded/spark-k8s-glue:v3.1.2-hadoop-3.3.1-v4
public.ecr.aws/dataminded/spark-k8s-glue:v3.0.3-hadoop-3.3.1-v3

Setuptools installing python packages

In version 60.0.0 of setuptools a breaking change was introduced that caused all packages on Debian to be installed in the site-packages instead of dist-packages directory. We fixed this starting from these Spark images:

public.ecr.aws/dataminded/spark-k8s-glue:v3.2.0-2.13-hadoop-3.3.1-v3
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.0-hadoop-3.3.1-v3
public.ecr.aws/dataminded/spark-k8s-glue:v3.1.2-hadoop-3.3.1-v5
public.ecr.aws/dataminded/spark-k8s-glue:v3.0.3-hadoop-3.3.1-v4

Downgrade Apache HttpClient

Spark 3.2.0 and 3.2.1 use version 4.5.13 of the Apache HttpClient which has an issue when validating the hostname of SSL certificates as described this Github issue.

We patched these images to contain the latest working version of the HttpClient, namely (4.5.10). This fix is included in the following Spark images and later versions:

public.ecr.aws/dataminded/spark-k8s-glue:v3.2.0-hadoop-3.3.1-v4
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.1-hadoop-3.3.1-v2

CVE 2022-42889

Apache commons-text has a vulnerability, described in CVE-2022-42889. To make sure no one can exploit the vulnerability, we removed the StringSubstitutor class from the respective jars. The first image versions containing this fix are:

public.ecr.aws/dataminded/spark-k8s-glue:v3.3.0-hadoop-3.3.4-v3
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.0-2.13-hadoop-3.3.4-v3
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.2-hadoop-3.3.4-v2
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.2-2.13-hadoop-3.3.4-v2
public.ecr.aws/dataminded/spark-k8s-glue:v3.1.3-hadoop-3.3.1-v3
public.ecr.aws/dataminded/spark-k8s-glue:v3.0.3-hadoop-3.3.1-v7

Reading Glue datasets that are partitioned by date

From Spark 3.1 onwards, Spark supports pushing down partition filtering on dates to the Hive metastore. This unfortunately introduced a bug with the Glue metastore. If you have an error similar to the following, you are running into the issue:

org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unsupported expression '2023 - 05 - 03'
(Service: AWSGlue; Status Code: 400; Error Code: InvalidInputException; Request ID: <uuid>; Proxy: null)

Starting from these Spark image versions, a fix is included to properly support Glue:

public.ecr.aws/dataminded/spark-k8s-glue:v3.3.2-hadoop-3.3.5-v2
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.2-2.13-hadoop-3.3.5-v2
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.4-hadoop-3.3.5-v1
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.4-2.13-hadoop-3.3.5-v1
public.ecr.aws/dataminded/spark-k8s-glue:v3.1.3-hadoop-3.3.5-v1

Azure MSAL TokenProvider

We fixed an issue in the MSAL TokenProvider in public.ecr.aws/dataminded/spark-k8s-glue:v3.2.1-hadoop-3.3.1-v6, where it would not correctly refresh tokens for long-running jobs.

Azure Image details (deprecated)

info

Starting from image public.ecr.aws/dataminded/spark-k8s-glue:v3.2.1-hadoop-3.3.1-v4, all Azure libraries are added to the standard image so there is no need to use these Azure specific images anymore.

The following images have been tested and confirmed to be working with Azure blob storage.

Name	Spark	Scala	Hadoop	Msal	Python
public.ecr.aws/dataminded/spark-k8s-azure:3.2.0-hadoop-3.3.1-v1	3.2.0	2.12	3.3.1	1.11.2	3.9.2
public.ecr.aws/dataminded/spark-k8s-azure:3.2.1-hadoop-3.3.1-v1	3.2.1	2.12	3.3.1	1.11.2	3.9.2
public.ecr.aws/dataminded/spark-k8s-azure:3.2.1-2.13-hadoop-3.3.1-v1	3.2.1	2.13	3.3.1	1.11.2	3.9.2

Starting from image with tag: v3.3.1-hadoop-3.3.4-v2, we package several jars of to AWS SDK 2 next to the full bundle of AWS SDK 1 (more info in the developer guide). This is needed to support Apache Iceberg, more details can be found in the AWS dependencies section.

To prevent issues resulting from version conflicts, please ensure that the following packages are not packaged with your code. A working combination-of-versions is provided in the Conveyor base images for:
apache.org.spark:*, com.amazonaws:aws-*, and org.apache.hadoop:*.

If you are using Maven, this means you have to load these images with <scope>provided</scope>. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Starting from this version, we include Dataflint in our Spark images. Dataflint extends the Spark UI to make it easier to understand. For more information on Dataflint, take a look at their project page. ↩ ↩²
Delta did not yet support this Spark version when this image was released. Current compatibility can be found on: https://docs.delta.io/latest/releases.html ↩ ↩² ↩³ ↩⁴
Iceberg did not yet support this Spark version when this image was released. Current compatibility can be found on: https://iceberg.apache.org/releases/ ↩ ↩²

Latest images​

Image details​

Spark 4.0.x​

Spark 3.5.x​

Spark 3.4.x​

Spark 3.3.x​

Spark 3.2.x​

Spark 3.1.x​

Spark 3.0.x​

Background information​

Upgrading Scala Spark jobs from 3.3.x to 3.4.x​

Hadoop 3.3.1 and the Java SDK 1.11.901​

Spark and Python versions​

Hadoop and Java SDK​

Packaged AWS dependencies​

Security patches and bugfixes​

Log4j 1.x vulnerability​

Setuptools installing python packages​

Downgrade Apache HttpClient​

CVE 2022-42889​

Reading Glue datasets that are partitioned by date​

Azure MSAL TokenProvider​

Azure Image details (deprecated)​

Footnotes​

Latest images

Image details

Spark 4.0.x

Spark 3.5.x

Spark 3.4.x

Spark 3.3.x

Spark 3.2.x

Spark 3.1.x

Spark 3.0.x

Background information

Upgrading Scala Spark jobs from 3.3.x to 3.4.x

Hadoop 3.3.1 and the Java SDK 1.11.901

Spark and Python versions

Hadoop and Java SDK

Packaged AWS dependencies

Security patches and bugfixes

Log4j 1.x vulnerability

Setuptools installing python packages

Downgrade Apache HttpClient

CVE 2022-42889

Reading Glue datasets that are partitioned by date

Azure MSAL TokenProvider

Azure Image details (deprecated)

Footnotes