Add custom jar for Spark / PySpark jobs
Description
Sometimes you need to work with custom connectors (e.g., the Snowflake connector) for Spark, that we do not package by default into our Spark Docker image.
How to do it
Before you can use a custom connector in Spark/PySpark code, you need to make sure the jar file is on the classpath of your Spark job.
You can accomplish this by copying the jar file to the /opt/spark/jars
folder in our base image.
Download the jar
You can download the jar file manually, but it is better to use a dependency management tool (e.g. gradle, maven,...). If you are using maven, you can specify the spark-snowflake connector as follows:
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<version>3.3.0</version>
<executions>
<execution>
<id>copy</id>
<phase>package</phase>
<goals>
<goal>copy</goal>
</goals>
<configuration>
<artifactItems>
<artifactItem>
<groupId>net.snowflake</groupId>
<artifactId>spark-snowflake_2.13</artifactId>
<version>2.10.0-spark_3.2</version>
<type>jar</type>
<overWrite>false</overWrite>
<outputDirectory>jars</outputDirectory>
</artifactItem>
</artifactItems>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
In order to download the connector, you should run: mvn package
, which will download the connector to your ./jars
directory.
Add the jar to your Docker image
All that is left to do is copy the jar file to your docker image with the following code snippet:
FROM public.ecr.aws/datamindedbe/spark-k8s-glue:v3.2.1-hadoop-3.3.1-v3
COPY jars/*.jar /opt/spark/jars/
Combine the previous steps into one
If you do not want to install maven or use two distinct steps,
you can use a docker multi-stage build to download the connector and copy it to your Spark image.
The Dockerfile
will look similar as shown below, and you only have to run:
docker build . -t myimage
.
FROM maven:3.5-jdk-8-alpine as builder
WORKDIR /opt/mvn/work-dir
COPY pom.xml /opt/mvn/work-dir
RUN mvn package
FROM public.ecr.aws/dataminded/spark-k8s-glue:v3.2.1-hadoop-3.3.1-v3
COPY --from=builder /opt/mvn/work-dir/jars/*.jar /opt/spark/jars/