Using Polars in IDEs
We notice more and more people are using Polars as a replacement for Spark in their Python projects. Polars is fast and has a similar API compared to Spark, which makes it an obvious choice for many. However, you might run into memory issues when attempting to process large datasets using Polars within an IDE.
When working in IDEs there are 2 main causes for memory issues:
- the Python kernel is poor at managing memory
- Polars (in non streaming mode) is very eager to take up all the available memory
Both issues are described in more detail below.
Python kernel and memory consumption
When you are using Polars in an IDE, you typically use a Python kernel to execute your notebook code. A known limitation of the Python kernel is that it is poor at managing memory. As a consequence if you are doing a lot of queries with Polars, even on small/medium datasets, it might happen that the memory consumption of the Python kernel goes up until the process gets killed due to OOM.
It is recommended to restart the kernel from time to time to free up memory, as described here.
Polars and memory consumption
If you use Polars in it's currently default (non-streaming mode), it will by default very eagerly take up all the available memory of your machine/container. When Polars is hitting the memory limit of your IDE, it will get slow and will make your IDE unresponsive. If Polars goes even further, the VS Code process might get killed as we limit the amount of memory that vscode and its subprocesses can use. We do this to ensure that the Python process gets killed instead of the full IDE, as this would result in you losing your data.
To identify whether you are running into this issue, you can check the memory consumption of your IDE using multiple tools, like: htop
, free
, etc.
free -m -s 1
This shows you the free memory in megabytes with second resolution.
How to resolve this issue
Ideally, the Conveyor team would like to limit the amount of memory that Polars can use. Unfortunately, Polars does not support this yet. To follow this up, an issue was opened on Github.
If you are running into memory issues when using Polars, there are a couple of ways to work around this problem:
- Limit the size of the datasets that you are using. If you are working with large datasets, try to limit the amount of data that needs to be pulled in memory by Polars. This means only reading 1000 rows instead of the full dataset or limiting the columns to the ones you actually need.
- Use the streaming mode of Polars.
Polars has a streaming mode, which is more memory efficient as it will flush data to disk if it reaches the memory limit instead of going OOM.
The difference is that you work with a LazyFrame instead of a Dataframe in polars.
To accomplish this, you should start by reading your data using:
polars.scan()
instead ofpolars.read()
If none of these options is valid for your use case, you can also use the Conveyor SDK, to launch the Polars job in a separate container, where it can use all the memory available to the container. More information about the Conveyor SDK can be found at here.
If you have any questions or issues, feel free to reach out to the Conveyor team.