Usage with Poetry

Introduction

Poetry is a tool for dependency management and packaging in Python. It offers a lockfile to ensure repeatable installs (similar to pip-tools and Pipenv), and can build your project for distribution (similar to setuptools).

One of the core features of Poetry is project environment isolation. This means that it will always work isolated from your global Python installation, choosing instead to automatically create and manage virtual environments for your projects.

Spark container images

The container images for Spark that are provided by Conveyor ship with a Spark installation that is linked into the system Python of the container. Given that Poetry will not work with the (global) system Python, this presents a problem for combining Poetry with the Spark container images.

Because of this mismatch in approaches, it is best to not depend on Poetry when using the Spark container images as base. However, this does not mean that Poetry cannot be used to manage your Python project and dependencies.

A possible approach is to use Poetry to prepare artefacts which are then copied into the Spark container and installed there. This approach does not require Poetry to be present in the final container image, and avoid the issue with virtual environments.

Example

This example leverages a multi-stage build to keep the size of the final container to a minimum.

.dockerfile
# Base image that is shared between the builder and final image
FROM public.ecr.aws/dataminded/spark-k8s-glue:v3.5.2-hadoop-3.3.6-v1 AS spark

ENV PYSPARK_PYTHON=python3
USER 0

# Builder image that will install poetry and prepare the artefacts
FROM spark AS builder
RUN python3 -m pip install poetry

COPY . .

# Transform the dependencies recorded by Poetry into a requirements.txt file
RUN poetry export --without-hashes --format=requirements.txt > /requirements.txt

# Package your project into a Python wheel
RUN poetry build -o /dist


# Build your final image that will be used to run your project
FROM spark AS final

WORKDIR /opt/spark/work-dir

# Copy the requirements.txt from the builder image and install
COPY --from=builder /requirements.txt .
RUN python3 -m pip install -r requirements.txt

# Copy the Python wheel from the builder image and install
COPY --from=builder /dist ./dist
RUN python3 -m pip install dist/*.whl

# Copy your main Python scripts to the container to make it easier to start your application (optional)
COPY ./src/app ./app

# Spark runs their images under this user
ARG SPARK_UID=185
USER ${SPARK_UID}

Using this Dockerfile allows running the Spark application that is located in your project under src/app/main.py as follows:

dags/my_dag.py
from conveyor.operators import ConveyorSparkSubmitOperatorV2

ConveyorSparkSubmitOperatorV2(
    task_id="my_task",
    application="local:///opt/spark/work-dir/app/main.py",
    ...
)

FAQ

Should I use Poetry instead of other tools?

Poetry is an opinionated framework whose ideas fit well for local development, but require some extra effort to integrate with containers. This is the main reason that we provide this specific guide for Poetry, and not alternatives like pip-tools or Pipenv. Other dependency managers work in a less "managed" fashion, and integration with containers is more straightforward as a result, making the need for a specific guide unnecessary. In case you are using another tool and are facing difficulties with getting your project properly installed into the Spark base images, please reach out to your support channel.

Can I install pyspark via Poetry and avoid this problem?

The Spark container images contain a working Spark installation, together with a selection of integrations (AWS Glue Catalog, Delta Lake, Apache Iceberg). Installing another copy of Spark on top of the existing one will greatly increase the size of your container, and not allow you to make use of the integrations provided by the Conveyor team. This is not an approach we recommend.

Can I change the entrypoint of my container to Poetry?

Changing the entrypoint of a Spark container is likely to break the Spark setup in distributed applications. In these cases, launching an application means starting a container that runs spark-submit. This process will then autonomously start a driver process and workers, but these secondary containers expect their entrypoint to be Spark, not Poetry (or something else). Excepting highly specific cases, we recommend running the Spark containers with their default entrypoint.

Introduction​

Spark container images​

Example​

FAQ​

Should I use Poetry instead of other tools?​

Can I install pyspark via Poetry and avoid this problem?​

Can I change the entrypoint of my container to Poetry?​