Skip to content

nine: Reduce virtual memory usage of textures

Axel Davy requested to merge axeldavy/mesa:nine_memfd into master

One of the main issues remaining to be fixed for Gallium Nine is out of memory issues with some 32 bits games. https://github.com/iXit/wine-nine-standalone/issues/24 https://github.com/iXit/wine-nine-standalone/issues/60 https://github.com/iXit/Mesa-3D/issues/309 https://github.com/iXit/Mesa-3D/issues/284

Indeed the CPU memory available for 32 bit apps on windows is tight. 4GB with the LARGEADRESSAWARE bit, 2GB else. This space is used for allocations, but also for other things like GPU buffer mappings. The amount used is called virtual memory.

When the virtual memory used gets too close to the limit, the app can crash with an out of memory message.

Texture data is one of the biggest users of virtual memory. Most games seem to use MANAGED textures (RAM backing + GPU copy) for most of their textures. Alternatively they use DEFAULT textures (GPU only), and use intermediate SYSTEMMEM textures (RAM only) to fill them. It seems out of memory issues is most common when apps use MANAGED textures.

D3D9 textures can hold virtual memory 3 ways:

  1. MANAGED textures have a RAM copy of any texture
  2. SYSTEMMEM textures
  3. DEFAULT textures (or the GPU copy of MANAGED textures) being mapped.

Nine cannot do much for 3). It's up to the gallium driver to really unmap textures when Nine asks for it, at least on 32 bits. Radeonsi does that.

It's not clear whether on Windows anything special is done for 1) and 2). However there is clear indication some efforts have been done on 3) to really unmap when it makes sense.

My understanding is that other implementations (wined3d, DXVK) reduce the usage of 1) by deleting the RAM copy once the GPU version is uploaded (DXVK's behaviour is controlled by the evictManagedOnUnlock config parameter which is enabled on a per-game basis).

The obvious issue with that approach is whether the texture is read by the application after some time. In that case, we have to recreate the RAM backing from the GPU buffer.

And apps DO that. Indeed I found that for example Mass Effect 2 with high resolution texture mods (one of the crash cases fixed by this patch series), when the character gets close to an object, a high resolution texture is created and filled to replace the low resolution one. The high resolution one simply has more levels, and the game seems to optimize reading the high res texture by retrieving the small-resolution levels from the original low resolution texture. In other words during gameplay, the game will randomly read MANAGED textures. And it expects it to be fast as the data is supposed to be in RAM...

Thus, instead of taking the RAM copy eviction approach, this patchset proposes a different approach: storing in memfd and release the virtual memory attached to the allocation until needed. Memfd files is a feature of the linux kernel. It enables to allocate a file stored in RAM and visible only to the app. We can map/unmap portions of the file as we need. When a portion is mapped, it takes virtual memory space. When it is not, it doesn't. The file is stored in RAM, and thus the access speed is the same as normal RAM.

Basically instead of using malloc(), we create a memfd file and map it. When the data doesn't seem to be accessed anymore, we can unmap the memfd file. If the data is needed, the memfd file is mapped again. This trick enables to allocate more than 4GB on 32 bits apps, while reducing virtual memory usage.

The advantage of this approach over the RAM eviction one, is that the load is much faster and doesn't block the GPU.

This approach however adds some overhead compared to doing nothing: when accessing mapped content the first time, pages are allocated by the system. This has a lot of overhead (several times the time to memset the area). Releasing these pages (when unmapping) has overhead too, though significantly less.

This overhead however is much less significant than the overhead of downloading the GPU content. In addition, we reduce significantly the overhead spent in Gallium nine for new allocations by using the fact new contents of the file are zero-allocated. By not calling memset in Gallium nine, the overhead of page allocation happens client side, thus outside the d3d mutex. This should give a performance boost for multithreaded applications. As malloc also has this overhead (at least for large enough allocations which use mmap internally), allocating ends up faster than with the standard allocation path.

By far the overhead induced by page allocation/deallocation is the biggest overhead involved in this code. It is reduced significantly with huge pages, but it is too complex to configure for the user to use it (and it has some memory management downsides too). The memset trick enables to move most of the overhead outside Nine anyway.

To prevent useless unmappings quickly followed by mapping again, we do not unmap right away allocations that are not locked for access anymore. Indeed it is likely the allocation will be accessed several times in a row, for example first to fill it, then to upload it. We keep everything mapped until we reach a threshold of memory allocated. Then we use hints to prioritize which regions to unmap first. Thus virtual memory usage is only reduced when the threshold is reached.

Multiple memfd files are used, each of 100MB. Thus memory usage (but not virtual memory usage) increases by amounts of 100MB. When not on x86 32 bits, we do use the standard malloc.

Finally, for ease of use, we do not implement packing of allocation inside page-aligned regions. One allocation is given one page-aligned region inside a memfd file. Allocations smaller than a page (4KB on x86) go through malloc. As texture sizes are usually multiples of powers of two, allocations above the page size are typically multiples of the page size, thus space is not wasted in practice.

Of course we have problems if there's not enough memory to map the memfd file. But the problem is the same for the RAM eviction approach.

Naturally on 64 bits, we do not use memfd.

Merge request reports