Policy: not recovering from memory allocation failures

In libweston, should every `malloc()` and similar call be checked and some recovery path exist if it fails, or should we just `abort()` if any memory allocation fails?

This is a question me and @daniels have been debating perhaps every year. I have been thinking it's better to try to handle such failures, while Daniel sees the basically untestable failure paths as a burden of developing and carrying code that is likely broken.

My opinion has started to shift. Here's a summary of the two sides, mostly to convince myself that recovering from allocation failures is futile in the context of libweston.

Disadvantages of trying to recover from memory allocation failures

Trying to recover from every allocation failure everywhere creates a huge number of code paths that cannot realistically ever be tested. Hence we cannot be sure they work, or what the effect will actually be.
These failure paths clutter the code and add a lot of divergent code paths.
Some compositor steady-state operations routinely do allocations, like allocating internal state structures for a simple screen update, or even log messages. If these allocations fail, then even in the best case the compositor appears to freeze with no explanation in the logs.
Wayland client request processing often allocates. If that fails, Weston could disconnect the client and continue. This probably does not fix the out-of-memory situation, so Weston ends up disconnecting client after client until almost none is left. In that case, what is the point of keeping Weston limping along? The client state is lost anyway. Also the client chosen to be disconnected is arbitrary and likely not the client that caused Weston to allocate lots of memory.
Memory overcommit in the kernel, which is usually on, makes allocation failures never happen. The process just segfaults when a paging request suddenly cannot be serviced. So why bother.

Advantages of recovering from memory allocation failures

Weston can keep on running as much as it can even if bad things happen in the system. When memory becomes available again, Weston automatically resumes normal operation.
malloc() can fail even if the system as a whole uses overcommit: Weston might be running in a container or cgroup with deliberately limited resources.
Being able to operate "reliably" in memory constrained environments might be a relatively unique selling point for Weston among compositors or compositor libraries.
Weston allocates memory on behalf of Wayland clients. A broken or malicious client can make Weston allocate practically unlimited amounts of memory. If memory allocation can be traced back to a client, disconnecting the client on allocation failure might help, with luck. (Of course, this doesn't actually work without tracking allocations per client and targeting the client causing too much allocations, preferably already before any allocation actually fails.)
The recovery code paths may not be reliable now, but if someone wanted to really make libweston "safe" against allocation failures, having the existing recovery paths would lessen the amount of work needed.

To me, the advantages are starting to feel pretty weak, more like excuses for sticking to the habit.

Giving up on recovering from allocation failures

We can start using the x*alloc() helpers everywhere. These functions either return memory, or abort the process. This means that we will be able to garbage-collect a huge number of failure paths that simply cannot happen anymore. Many functions no longer need failure return codes. This will also improve our code branch test coverage percentage.

It becomes much easier to reason about libweston behavior. Failures happen for more interesting reasons than simple out-of-memory.

There will likely still be cases where it is justified to recover from memory allocation failure, especially when allocating special memory like device memory (graphics, KMS dumb buffers). Also if a single protocol request could allocate a massive chunk of memory, that's probably best checked still - we know which client to blame.

If you don't want libweston to give up on pretending it might recover from allocation failures, please comment here and explain why.

Otherwise, I propose to record this design decision in libweston documentation, which means changes to how merge requests are reviewed, and gives the green light for the gradual migration to abort-on-failure memory allocation routines.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message