Skip to main content

Spark / PySpark issues

Table of Contents

Spark job throws OOMKilled errors

OOMKilled errors mean that Kubernetes killed your Spark pod because it consumed too much memory. Kubernetes does this to protect other applications running on the machines. If you have OOMKilled errors and not Java virtual machine (JVM) OutOfMemoryError, this means there is memory used outside the JVM. This typically happens with PySpark jobs which run UDF's, but it might also happen in regular Spark jobs running on large nodes.

To fix this, you can tweak the spark.kubernetes.memoryOverheadFactor parameter. This setting controls the Memory Overhead Factor that will allocate memory to non-JVM memory, which includes off-heap memory allocations, non-JVM tasks, and various systems processes. For JVM-based jobs, this value will default to 0.10 and 0.40 for non-JVM jobs. The value is larger for non-JVM tasks as they require more non-JVM heap space. Increasing the default value should eliminate this error.

Spark job succeeds but the logs contain error messages com.amazonaws.services.s3.model.MultiObjectDeleteException

These error messages pop up in the Spark logs when using fine-grained access permissions for access on S3. They are typically unharmful, but can be annoying when trying to debug real errors. The newer Spark images provided by the Conveyor team have a log4j file built-in that you can use to reduce the noise in the Spark logs. This log4j configuration file is used by default. If you want to reduce the noise of the logs even further, you can either update the existing file or create a new one and add the following configuration to the conf-dictionary you pass to ConveyorSparkSubmitOperator or ConveyorSparkSubmitOperatorV2:

"spark.driver.extraJavaOptions": "-Dlog4j.configuration=file:///opt/spark/log4j/log4j.properties"

The first images which have this crafted log4j.properties file are given below. All future images released by Conveyor will contain this properties file.

public.ecr.aws/dataminded/spark-k8s:2.4.3-2.11-hadoop-2.9.2-v2
public.ecr.aws/dataminded/spark-k8s:2.4.3-2.12-hadoop-2.9.2-v2
public.ecr.aws/dataminded/spark-k8s-glue:2.4.3-2.11-hadoop-2.9.2-v2
public.ecr.aws/dataminded/spark-k8s-glue:2.4.3-2.12-hadoop-2.9.2-v2

The spark application shows as failed in Airflow, but the Conveyor UI says it is still running

In EKS 1.15 Amazon changed something to how etcd works. This change breaks how spark-submit works on long-running Spark jobs. We have made a Pull Request to Spark to fix this issue. This fix has been included in the Spark 3.x Docker images provided by Conveyor.

Glue org.apache.hadoop.hive.metastore.api.InvalidObjectException

If you ever get a message like the following:

org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unsupported expression '2022 - 02 - 10' (Service: AWSGlue; Status Code: 400; Error Code: InvalidInputException; Request ID: ; Proxy: null)
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 485, in show
print(self._jdf.showString(n, 20, vertical))
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
raise converted from None
pyspark.sql.utils.AnalysisException: org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unsupported expression '2022 - 02 - 10' (Service: AWSGlue; Status Code: 400; Error Code: InvalidInputException; Request ID: ; Proxy: null)

It might be because you are using a version of Spark containing an issue with the Glue connection when reading in data sets partitioned by date. We are looking into a fix with AWS for this issue. In the meantime, you should set the following settings for Spark:

from conveyor.operators import ConveyorSparkSubmitOperatorV2

ConveyorSparkSubmitOperatorV2(
...,
conf={
"spark.sql.hive.metastorePartitionPruning": "false",
"spark.sql.hive.convertMetastoreParquet": "false",
},
)

This disables partition pruning using the Hive metastore, but Spark will still do partition pruning after it receives the partitions from the Hive metastore. It means Spark has to perform more work processing the partition data, but the incurred overhead should be limited.

Spark 3.3.0 job fails to launch executors because podNamePrefix is invalid

If you get the following error when your driver attempts to launch an executor:

ERROR ExecutorPodsSnapshotsStoreImpl: Going to stop due to IllegalArgumentException
java.lang.IllegalArgumentException: 'reader-v2-9477e9c1b35e400384183e084-d35b9283c983aacf' in spark.kubernetes.executor.podNamePrefix is invalid. must conform https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-label-names and the value length <= 47
at org.apache.spark.internal.config.TypedConfigBuilder.$anonfun$checkValue$1(ConfigBuilder.scala:108)
at org.apache.spark.internal.config.TypedConfigBuilder.$anonfun$transform$1(ConfigBuilder.scala:101)
at scala.Option.map(Option.scala:230)
...

In Spark 3.3.0, additional validation is performed on the executor podNamePrefix with a maximum limit of 47 characters. This maximum is incorrect, and the limit will be increased to 237 characters in Spark 3.3.1, as described here.

In the meantime, you can work around the issue by setting the spark.application.name property explicitly in your code and making sure it has less than 30 characters (executor podNamePrefix=${spark-app-name}-${unique-id}). For PySpark you can do this as follows:

from pyspark import SparkSession

SparkSession.builder.appName("pyspark sample")

Spark job does not have a Spark eventlog file

If your Spark application creates multiple Spark contexts, it will generate multiple event log files. In this case, we do not upload any of the event log files to S3, which is why the command to create a spark-history server will fail. If you are in this situation, go over your code and make sure you only use 1 Spark context for your entire application.

Spark job using Iceberg fails due to NoClassDefFoundError UrlConnectionHttpClient

If you get an error like:

File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 1055, in table
File "/opt/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco
File "/opt/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o102.table.
: java.lang.NoClassDefFoundError: software/amazon/awssdk/http/urlconnection/UrlConnectionHttpClient
at org.apache.iceberg.aws.AwsProperties.applyHttpClientConfigurations(AwsProperties.java:1139)
at software.amazon.awssdk.utils.builder.SdkBuilder.applyMutation(SdkBuilder.java:61)
...

When using Apache Iceberg in your Spark application, you must specify the correct http-client.type property. Since our spark image only package the Apache HTTP client, this property should be set as follows:

spark.sql.catalog.<catalog_name>.http-client.type": "apache"

For more information on how to configure your Spark application to use Apache Iceberg, please refer to our guide

Create Spark job fails due to: spark.kubernetes.file.upload.path not specified

The root cause of this exception is that the application parameter of the SparkSubmitOperator is not specified as a URI but instead uses an absolute path. If you are using pyspark, then the application parameter should not be:

/opt/spark/work-dir/src/jobs/app_spark.py

but rather it should be:

local:///opt/spark/work-dir/src/jobs/app_spark.py

in order for Spark to know the Spark application is packaged in the Docker image.

note

In Conveyor 1.15.3 we automatically convert the absolute path to a URI if you forget to do so. This error should thus not occur anymore.