Skip to main content

Add custom jar for Spark / PySpark jobs

Description

Sometimes you need to work with custom connectors (e.g., the Snowflake connector) for Spark, that we do not package by default into our Spark Docker image.

How to do it

Before you can use a custom connector in Spark/PySpark code, you need to make sure the jar file is on the classpath of your Spark job.

You can accomplish this by copying the jar file to the /opt/spark/jars folder in our base image.

Download the jar

You can download the jar file manually, but it is better to use a dependency management tool (e.g. gradle, maven,...). If you are using maven, you can specify the spark-snowflake connector as follows:

<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<version>3.3.0</version>
<executions>
<execution>
<id>copy</id>
<phase>package</phase>
<goals>
<goal>copy</goal>
</goals>
<configuration>
<artifactItems>
<artifactItem>
<groupId>net.snowflake</groupId>
<artifactId>spark-snowflake_2.13</artifactId>
<version>2.10.0-spark_3.2</version>
<type>jar</type>
<overWrite>false</overWrite>
<outputDirectory>jars</outputDirectory>
</artifactItem>
</artifactItems>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>

In order to download the connector, you should run: mvn package, which will download the connector to your ./jars directory.

Add the jar to your Docker image

All that is left to do is copy the jar file to your docker image with the following code snippet:

FROM public.ecr.aws/datamindedbe/spark-k8s-glue:v3.2.1-hadoop-3.3.1-v3

COPY jars/*.jar /opt/spark/jars/

Combine the previous steps into one

If you do not want to install maven or use two distinct steps, you can use a docker multi-stage build to download the connector and copy it to your Spark image. The Dockerfile will look similar as shown below, and you only have to run:

docker build . -t myimage.

FROM maven:3.5-jdk-8-alpine as builder

WORKDIR /opt/mvn/work-dir
COPY pom.xml /opt/mvn/work-dir
RUN mvn package

FROM public.ecr.aws/dataminded/spark-k8s-glue:v3.2.1-hadoop-3.3.1-v3
COPY --from=builder /opt/mvn/work-dir/jars/*.jar /opt/spark/jars/