Improve Spark efficiency
Recently, we released a new feature that analyzes the efficiency of your Spark jobs. This guide helps users to improve the efficiency of an existing (inefficient) Spark application. We use efficiency to measure wasted resources and thus unnecessary costs for a given output. Efficiency is about utilizing the available resources in the best possible way. This is different from optimizing the performance of your job, which is about making your job faster without considering the extra cost.
Approach
- Make sure you are using a recent version of Spark. We recommend using Spark 3.2.0 or later. The reason for this is that Spark between 3.0.0 and 3.2.0 has a lot of improvements in their AQE support. All the Conveyor supported Spark versions are described in our Spark image page.
- Use the metrics page in Conveyor to see if the requested resources match the used resources of your job. You can find the metrics of your spark job using the details page of your application. These metrics give a clue whether you are over- or under-provisioning your job. If this is the case, you can adjust the instance type of your driver or executors to better match the resources needed for your job.
- Take a look at the Spark UI to analyze your job in detail. For more information look at the next section.
- Make changes to your code and/or the Spark configuration to improve the efficiency of your job.
- Run your updated job and recalculate the efficiency to see the impact of your changes.
You can trigger the efficiency calculation of your job by clicking on the
Calculate
button on the application details page. By default, we calculate the efficiency of your job once per day.
How to use the Spark UI for your job
The Spark UI provides a lot of information about your job. A good place to start is using the Dataflint
tab, which provides an easier to understand overview.
Start by taking a look at the Alerts section, which identifies several common issues for Spark jobs.
If there is an alert, read the description and follow the advice to improve the efficiency of your job.
To read more about Dataflint, see the Dataflint documentation.
If Dataflint
does not provide any useful information, look through the Spark history UI for potential issues.