Skip to main content

Spark

Data Minded provides you with base images that work on both AWS and Azure. These images allow you to use AWS Glue as the Hive metastore when running your workloads on Conveyor.

Latest images

The latest-released images for the major Spark versions are:

  • public.ecr.aws/dataminded/spark-k8s-glue:v3.5.1-hadoop-3.3.6-v2
  • public.ecr.aws/dataminded/spark-k8s-glue:v3.5.1-2.13-hadoop-3.3.6-v2
  • public.ecr.aws/dataminded/spark-k8s-glue:v3.4.1-hadoop-3.3.6-v1
  • public.ecr.aws/dataminded/spark-k8s-glue:v3.4.1-2.13-hadoop-3.3.6-v1
  • public.ecr.aws/dataminded/spark-k8s-glue:v3.3.3-hadoop-3.3.5-v1
  • public.ecr.aws/dataminded/spark-k8s-glue:v3.3.3-2.13-hadoop-3.3.5-v1
  • public.ecr.aws/dataminded/spark-k8s-glue:v3.2.4-hadoop-3.3.5-v1
  • public.ecr.aws/dataminded/spark-k8s-glue:v3.2.4-2.13-hadoop-3.3.5-v1
  • public.ecr.aws/dataminded/spark-k8s-glue:v3.1.3-hadoop-3.3.5-v1
  • public.ecr.aws/dataminded/spark-k8s-glue:v3.0.3-hadoop-3.3.4-v8
info

We deprecated the Azure-specific images and integrated the Azure libraries into our standard images starting from: public.ecr.aws/dataminded/spark-k8s-glue:v3.2.1-hadoop-3.3.1-v4

Image details

The following images have been tested and confirmed to be working:

Spark 3.5.x

NameSparkScalaHadoopOpenJDKAWS SDKAWS SDK v21PythonMSALDeltaIceberg
public.ecr.aws/dataminded/spark-k8s-glue:v3.5.1-hadoop-3.3.6-v223.5.12.123.3.617.0.91.12.3672.20.1623.10.121.12.03.1.01.4.3
public.ecr.aws/dataminded/spark-k8s-glue:v3.5.1-2.13-hadoop-3.3.6-v223.5.12.133.3.617.0.91.12.3672.20.1623.10.121.12.03.1.01.4.3
public.ecr.aws/dataminded/spark-k8s-glue:v3.5.0-hadoop-3.3.6-v23.5.02.123.3.617.0.91.12.3672.20.1623.10.121.12.03.0.01.4.2
public.ecr.aws/dataminded/spark-k8s-glue:v3.5.0-2.13-hadoop-3.3.6-v23.5.02.133.3.617.0.91.12.3672.20.1623.10.121.12.03.0.01.4.2
public.ecr.aws/dataminded/spark-k8s-glue:v3.5.0-hadoop-3.3.6-v13.5.02.123.3.617.0.81.12.367/3.10.121.11.2No support3No support4
public.ecr.aws/dataminded/spark-k8s-glue:v3.5.0-2.13-hadoop-3.3.6-v13.5.02.133.3.617.0.81.12.367/3.10.121.11.2No support3No support4

Starting from this version, we include Dataflint in our Spark images. Dataflint extends the Spark UI to make it easier to understand. For more information on Dataflint, take a look at their project page.

Spark 3.4.x

NameSparkScalaHadoopOpenJDKAWS SDKAWS SDK v21PythonMSALDeltaIceberg
public.ecr.aws/dataminded/spark-k8s-glue:v3.4.1-hadoop-3.3.6-v13.4.12.123.3.617.0.71.12.3672.18.413.10.61.11.22.4.01.3.0
public.ecr.aws/dataminded/spark-k8s-glue:v3.4.1-2.13-hadoop-3.3.6-v13.4.12.133.3.617.0.71.12.3672.18.413.10.61.11.22.4.01.3.0

Spark 3.3.x

NameSparkScalaHadoopOpenJDKAWS SDKAWS SDK v21PythonMSALDeltaIceberg
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.3-hadoop-3.3.5-v13.3.32.123.3.511.0.161.12.2622.18.413.9.21.11.22.2.01.0.0
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.3-2.13-hadoop-3.3.5-v13.3.32.133.3.511.0.161.12.2622.18.413.9.21.11.22.2.01.0.0
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.2-hadoop-3.3.5-v23.3.22.123.3.511.0.161.12.2622.18.413.9.21.11.22.2.01.0.0
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.2-2.13-hadoop-3.3.5-v23.3.22.133.3.511.0.161.12.2622.18.413.9.21.11.22.2.01.0.0
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.2-hadoop-3.3.5-v13.3.22.123.3.511.0.161.12.2622.18.413.9.21.11.22.2.01.0.0
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.2-2.13-hadoop-3.3.5-v13.3.22.133.3.511.0.161.12.2622.18.413.9.21.11.22.2.01.0.0
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.2-hadoop-3.3.4-v23.3.22.123.3.411.0.161.12.2622.18.413.9.21.11.22.2.01.0.0
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.2-2.13-hadoop-3.3.4-v23.3.22.133.3.411.0.161.12.2622.18.413.9.21.11.22.2.01.0.0
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.2-hadoop-3.3.4-v13.3.22.123.3.411.0.161.12.2622.18.413.9.21.11.22.2.01.0.0
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.2-2.13-hadoop-3.3.4-v13.3.22.133.3.411.0.161.12.2622.18.413.9.21.11.22.2.01.0.0
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.1-hadoop-3.3.4-v23.3.12.123.3.411.0.161.12.2622.18.413.9.21.11.22.1.01.0.0
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.1-2.13-hadoop-3.3.4-v23.3.12.133.3.411.0.161.12.2622.18.413.9.21.11.22.1.01.0.0
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.1-hadoop-3.3.4-v13.3.12.123.3.411.0.161.12.262/3.9.21.11.22.1.0/
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.1-2.13-hadoop-3.3.4-v13.3.12.133.3.411.0.161.12.262/3.9.21.11.22.1.0/
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.0-hadoop-3.3.4-v33.3.02.123.3.411.0.151.12.262/3.9.21.11.22.1.0/
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.0-2.13-hadoop-3.3.4-v33.3.02.133.3.411.0.151.12.262/3.9.21.11.22.1.0/
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.0-hadoop-3.3.4-v23.3.02.123.3.411.0.151.12.262/3.9.21.11.22.1.0/
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.0-2.13-hadoop-3.3.4-v23.3.02.133.3.411.0.151.12.262/3.9.21.11.22.1.0/
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.0-hadoop-3.3.4-v13.3.02.123.3.411.0.151.12.262/3.9.21.11.2No support3/
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.0-2.13-hadoop-3.3.4-v13.3.02.133.3.411.0.151.12.262/3.9.21.11.2No support3/
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.0-hadoop-3.3.1-v13.3.02.123.3.111.0.141.11.901/3.9.21.11.2No support3/
public.ecr.aws/dataminded/spark-k8s-glue:v3.3.0-2.13-hadoop-3.3.1-v13.3.02.133.3.111.0.141.11.901/3.9.21.11.2No support3/

Spark 3.2.x

NameSparkScalaHadoopOpenJDKAWS SDKAWS SDK v21PythonMSALDeltaIceberg
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.4-hadoop-3.3.5-v13.2.42.123.3.511.0.161.12.262/3.9.21.11.22.0.0/
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.4-2.13-hadoop-3.3.5-v13.2.42.133.3.511.0.161.12.262/3.9.21.11.22.0.0/
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.3-hadoop-3.3.4-v13.2.32.123.3.411.0.161.12.262/3.9.21.11.22.0.0/
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.3-2.13-hadoop-3.3.4-v13.2.32.133.3.411.0.161.12.262/3.9.21.11.22.0.0/
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.2-hadoop-3.3.4-v23.2.22.123.3.411.0.161.12.262/3.9.21.11.22.0.0/
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.2-2.13-hadoop-3.3.4-v23.2.22.133.3.411.0.161.12.262/3.9.21.11.22.0.0/
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.1-hadoop-3.3.4-v83.2.12.123.3.411.0.161.12.262/3.9.21.11.22.0.0/
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.1-2.13-hadoop-3.3.4-v83.2.12.133.3.411.0.161.12.262/3.9.21.11.22.0.0/
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.1-hadoop-3.3.1-v73.2.12.123.3.111.0.151.11.901/3.9.21.11.21.2.1/
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.1-2.13-hadoop-3.3.1-v73.2.12.133.3.111.0.151.11.901/3.9.21.11.21.2.1/
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.1-hadoop-3.3.1-v33.2.12.123.3.111.0.141.11.901/3.9.2///
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.1-2.13-hadoop-3.3.1-v33.2.12.133.3.111.0.141.11.901/3.9.2///
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.0-hadoop-3.3.13.2.02.123.3.111.0.121.11.901/3.9.2///
public.ecr.aws/dataminded/spark-k8s-glue:v3.2.0-2.13-hadoop-3.3.1-v33.2.02.133.3.111.0.121.11.901/3.9.2///

Spark 3.1.x

NameSparkScalaHadoopOpenJDKAWS SDKAWS SDK v21PythonMSALDeltaIceberg
public.ecr.aws/dataminded/spark-k8s-glue:v3.1.3-hadoop-3.3.5-v13.1.32.123.3.511.0.161.12.262/3.9.2///
public.ecr.aws/dataminded/spark-k8s-glue:v3.1.3-hadoop-3.3.4-v43.1.32.123.3.411.0.161.12.262/3.9.2///
public.ecr.aws/dataminded/spark-k8s-glue:v3.1.3-hadoop-3.3.1-v23.1.32.123.3.111.0.121.11.901/3.9.2///
public.ecr.aws/dataminded/spark-k8s-glue:v3.1.2-hadoop-3.3.1-v33.1.22.123.3.111.0.121.11.901/3.9.2///
public.ecr.aws/dataminded/spark-k8s-glue:v3.1.2-hadoop-3.3.1-python-3.8-v23.1.22.123.3.111.0.121.11.901/3.8.7///
public.ecr.aws/dataminded/spark-k8s-glue:v3.1.2-hadoop-3.3.1-v23.1.22.123.3.111.0.121.11.901/3.7.3///
public.ecr.aws/dataminded/spark-k8s-glue:v3.1.2-hadoop-3.3.0-v23.1.22.123.3.011.0.111.11.563/3.7.3///
public.ecr.aws/dataminded/spark-k8s-glue:v3.1.1-hadoop-3.3.03.1.12.123.3.011.0.111.11.563/3.7.3///

Spark 3.0.x

NameSparkScalaHadoopOpenJDKAWS SDKAWS SDK v21PythonMSALDeltaIceberg
public.ecr.aws/dataminded/spark-k8s-glue:v3.0.3-hadoop-3.3.4-v83.0.32.123.3.411.0.161.12.262/3.9.2///
public.ecr.aws/dataminded/spark-k8s-glue:v3.0.3-hadoop-3.3.1-v63.0.32.123.3.111.0.151.11.901/3.9.2///
public.ecr.aws/dataminded/spark-k8s-glue:v3.0.2-hadoop-3.3.0-v23.0.22.123.3.011.0.111.11.563/3.7.3///
public.ecr.aws/dataminded/spark-k8s-glue:v3.0.1-hadoop-3.3.03.0.12.123.3.08u2651.11.563/3.7.3///
public.ecr.aws/dataminded/spark-k8s-glue:3.0.0-hadoop-3.2.13.0.02.123.2.18u2521.11.375/3.7.3///

If you do need a package that matches one of the expressions above but is not part of the Data Minded base image, you can include it in your jar, provided that you use the same version as is present in the base image that you are using.

The jars provided to Spark by the base image can be found at /opt/spark/jars in the base image. You can check the list of included jars using docker run <image name> ls /opt/spark/jars.

Background information

This section contains a bit more information on the Spark 3 images. In Spark 3, the base images were changed to OpenJDK Docker images based on Debian, the previous images were based on Alpine. You can refer to the SPARK-28938 ticket for more information.

The change from Alpine to Debian makes the base image bigger, but it enables the installation of packages like pandas to use prebuilt Python wheels. This significantly speeds up the installation process.

The new images are no longer run as root by default as well. Using a command such as pip install will result in an error because of this change. To execute commands which need root permissions, you should use this pattern now:

USER 0
RUN pip install pandas
ARG spark_uid=185
USER ${spark_uid}

User 0 is the root user, the user with number 185 is the user used in the official Spark images. Running containers as a non-root user is considered a security best practice.

Upgrading Scala Spark jobs from 3.3.x to 3.4.x

For PySpark jobs, no changes are needed. For Scala Spark, you might notice an error when upgrading from 3.3.x to 3.4.x:

Exception in thread "main" java.nio.file.NoSuchFileException: /opt/spark/work-dir/YOURJARNAME.jar

This is the result of a change in Spark where files are copied over to the work-dir after cleaning it up first. You can fix this by changing your Dockerfile from:

FROM public.ecr.aws/dataminded/spark-k8s-glue:v3.4.1-hadoop-3.3.6-v1

COPY build/libs/spark-*-all.jar /opt/spark/work-dir/YOURJARNAME.jar

to:

FROM public.ecr.aws/dataminded/spark-k8s-glue:v3.4.1-hadoop-3.3.6-v1

COPY build/libs/spark-*-all.jar /opt/spark/user-files/YOURJARNAME.jar

In your DAGs, you should change the application argument from:

from conveyor.operators import ConveyorSparkSubmitOperatorV2

ConveyorSparkSubmitOperatorV2(
...,
application="local:///opt/spark/work-dir/YOURJARNAME.jar",
)

to:

from conveyor.operators import ConveyorSparkSubmitOperatorV2

ConveyorSparkSubmitOperatorV2(
...,
application="local:///opt/spark/user-files/YOURJARNAME.jar",
)

Hadoop 3.3.1 and the Java SDK 1.11.901

Hadoop 3.3.1 upgrades the AWS Java SDK dependency to 1.11.901, enabling support for the new authentication method, used by the Conveyor V2 Airflow operators. For information on how to use this new way of authentication, you can look at our documentation Operators: Using an AWS Role.

Hadoop and Java SDK

The Hadoop AWS library is released with a tested Java SDK version. This tested version is also used when creating the Data Minded images. Adding additional Java SDKs with a newer version might result in strange failures. Always check that you use the same version as in the table above.

From datamindedbe/spark-k8s-glue:3.0.1-hadoop-3.3.0 onwards, the AWS SDK bundle is included so that you don't need to install extra SDK versions yourself.

Packaged AWS dependencies

Starting from image with tag: v3.3.1-hadoop-3.3.4-v2, we package several jars of the AWS SDK 2 next to the full bundle of AWS SDK 1. We introduced this in order to provide support for Apache Iceberg and includes the following AWS dependencies:

  • Glue
  • S3
  • STS
  • KMS
  • DynamoDB
  • Lake Formation
info

AWS SDK 1 and AWS SDK 2 use distinct package names, which is why the classes will never conflict. Both dependencies can safely be included in the same Docker image as a result.

There is some impact for Spark applications that also depend on AWS services:

  • If your application uses a service of AWS SDK V1 nothing changes, you do not need to add the respective jar to your Docker image. Alternatively, if you use Gradle/Maven, you can set the AWS SDK dependency to provided. This way it will use the AWS version included in the provided Spark image.

  • If your application depends on a service of AWS SDK V2 that is not packaged in the Docker image (as defined here), you need to add it as an explicit dependency.

    It is best to use the same version as the AWS SDK V2 packaged in the respective Spark image. This eliminates potential classpath issues due to version conflicts.

  • If your application depends on a service of AWS SDK V2 that is already packaged in the Docker image, you do not need to add it. In this case, the same approach can be used as described for AWS SDK V1.

  • If your application needs to use a different version of a jar existing in AWS SDK V1 or V2 than those packaged, you can best remove the packaged version and add all jars of your version. To inspect all AWS jars packaged in the docker image, you can run ls -l /opt/spark/jars | grep <aws-version> from within the Docker container.

Security patches and bugfixes

Log4j 1.x vulnerability

Spark packages log4j 1.x containing vulnerability (CVE-2021-4104), which uses the JmsAppender to execute malicious code. To mitigate this, we created Spark images that do not package the JmsAppender class. The first version of the different Spark 3 images that contain this fix are:

  • public.ecr.aws/dataminded/spark-k8s-glue:v3.2.0-2.13-hadoop-3.3.1-v2
  • public.ecr.aws/dataminded/spark-k8s-glue:v3.2.0-hadoop-3.3.1-v2
  • public.ecr.aws/dataminded/spark-k8s-glue:v3.1.2-hadoop-3.3.1-v4
  • public.ecr.aws/dataminded/spark-k8s-glue:v3.0.3-hadoop-3.3.1-v3

Setuptools installing python packages

In version 60.0.0 of setuptools a breaking change was introduced that caused all packages on Debian to be installed in the site-packages instead of dist-packages directory. We fixed this starting from these Spark images:

  • public.ecr.aws/dataminded/spark-k8s-glue:v3.2.0-2.13-hadoop-3.3.1-v3
  • public.ecr.aws/dataminded/spark-k8s-glue:v3.2.0-hadoop-3.3.1-v3
  • public.ecr.aws/dataminded/spark-k8s-glue:v3.1.2-hadoop-3.3.1-v5
  • public.ecr.aws/dataminded/spark-k8s-glue:v3.0.3-hadoop-3.3.1-v4

Downgrade Apache HttpClient

Spark 3.2.0 and 3.2.1 use version 4.5.13 of the Apache HttpClient which has an issue when validating the hostname of SSL certificates as described here.

We patched these images to contain the latest working version of the HttpClient, namely (4.5.10). This fix is included in the following Spark images and later versions:

  • public.ecr.aws/dataminded/spark-k8s-glue:v3.2.0-hadoop-3.3.1-v4
  • public.ecr.aws/dataminded/spark-k8s-glue:v3.2.1-hadoop-3.3.1-v2

CVE 2022-42889

Apache commons-text has a vulnerability, described in CVE-2022-42889. To make sure no one can exploit the vulnerability, we removed the StringSubstitutor class from the respective jars. The first image versions containing this fix are:

  • public.ecr.aws/dataminded/spark-k8s-glue:v3.3.0-hadoop-3.3.4-v3
  • public.ecr.aws/dataminded/spark-k8s-glue:v3.3.0-2.13-hadoop-3.3.4-v3
  • public.ecr.aws/dataminded/spark-k8s-glue:v3.2.2-hadoop-3.3.4-v2
  • public.ecr.aws/dataminded/spark-k8s-glue:v3.2.2-2.13-hadoop-3.3.4-v2
  • public.ecr.aws/dataminded/spark-k8s-glue:v3.1.3-hadoop-3.3.1-v3
  • public.ecr.aws/dataminded/spark-k8s-glue:v3.0.3-hadoop-3.3.1-v7

Reading Glue datasets that are partitioned by date

From Spark 3.1 onwards, Spark supports pushing down partition filtering on dates to the Hive metastore. This unfortunately introduced a bug with the Glue metastore. If you have an error similar to the following, you are running into the issue:

org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unsupported expression '2023 - 05 - 03'
(Service: AWSGlue; Status Code: 400; Error Code: InvalidInputException; Request ID: <uuid>; Proxy: null)

Starting from these Spark image versions, a fix is included to properly support Glue:

  • public.ecr.aws/dataminded/spark-k8s-glue:v3.3.2-hadoop-3.3.5-v2
  • public.ecr.aws/dataminded/spark-k8s-glue:v3.3.2-2.13-hadoop-3.3.5-v2
  • public.ecr.aws/dataminded/spark-k8s-glue:v3.2.4-hadoop-3.3.5-v1
  • public.ecr.aws/dataminded/spark-k8s-glue:v3.2.4-2.13-hadoop-3.3.5-v1
  • public.ecr.aws/dataminded/spark-k8s-glue:v3.1.3-hadoop-3.3.5-v1

Azure MSAL TokenProvider

We fixed an issue in the MSAL TokenProvider in public.ecr.aws/dataminded/spark-k8s-glue:v3.2.1-hadoop-3.3.1-v6, where it would not correctly refresh tokens for long-running jobs.

Azure Image details (deprecated)

info

Starting from image public.ecr.aws/dataminded/spark-k8s-glue:v3.2.1-hadoop-3.3.1-v4, all Azure libraries are added to the standard image so there is no need to use these Azure specific images anymore.

The following images have been tested and confirmed to be working with Azure blob storage.

NameSparkScalaHadoopMsalPython
public.ecr.aws/dataminded/spark-k8s-azure:3.2.0-hadoop-3.3.1-v13.2.02.123.3.11.11.23.9.2
public.ecr.aws/dataminded/spark-k8s-azure:3.2.1-hadoop-3.3.1-v13.2.12.123.3.11.11.23.9.2
public.ecr.aws/dataminded/spark-k8s-azure:3.2.1-2.13-hadoop-3.3.1-v13.2.12.133.3.11.11.23.9.2

Footnotes

  1. Starting from image with tag: v3.3.1-hadoop-3.3.4-v2, we package several jars of to AWS SDK 2 next to the full bundle of AWS SDK 1 (more info in the developer guide).

    This is needed to support Apache Iceberg, more details can be found in the AWS dependencies section.

    To prevent issues resulting from version conflicts, ensure that the following packages are not packaged with your code; a working combination-of-versions is provided in Data Minded's base images.

    • apache.org.spark:*
    • com.amazonaws:aws-*
    • org.apache.hadoop:*

    If you are using Maven, this means you have to load these images with <scope>provided</scope>. 2 3 4 5 6

  2. 2
  3. Delta did not yet support this Spark version when this image was released. Current compatibility can be found on: https://docs.delta.io/latest/releases.html 2 3 4 5 6

  4. Iceberg did not yet support this Spark version when this image was released. Current compatibility can be found on: https://iceberg.apache.org/releases/ 2