[Regression] JavaFX unbounded VRAM+RAM usage
Hello,
I'm reporting a fairly serious regression that seems to only occur on AMD Radeon cards when running JavaFX applications. A very similar bug seems to have happened back in 2016 as discussed here, here and also here. The issue there seems to have been resolved after upgrading to a certain Mesa version. It is not clear from the threads which version introduced the regression and which version fixed it.
Background:
JavaFX is cross-platform UI framework like Java Swing but features an OpenGL backend. The Linux port makes use of JNI to call the corresponding GLX, X11 and OpenGL APIs. The general flow is as follows:
- Create and obtain X11 window handle from Gtk3 (or conditionally Gtk2)
- GL related initialisation:
glXChooseFBConfig
, passing the first FB toglXGetVisualFromFBConfig
and then finallyglXCreateNewContext
(details in apitrace attachment) - JavaFX runtime switches between windows and some dummy drawables using
glXMakeCurrent
for rendering and resource management
Expected behavior:
JavaFX apps run without problems (e.g memory leaks and acceptable performance).
Actual behavior:
JavaFX app causes a memory leak leading to VRAM exhaustion on a per frame basis (leak stops when no drawing is required). A looping loading animation fills 16GB of VRAM on an AMD Radeon VII in less than 20 seconds depending on how big the window is. After VRAM is exhausted it starts to page into system RAM at a slightly slower rate but still consumes 32GB RAM within 60 seconds. The machine hangs after this.
If the LIBGL_DRI3_DISABLE
flag is enabled, the app works correctly but leaks 4~10k of memory as indicated in xrestop
.
The X server becomes increasingly sluggish, consuming ~ 100% of the CPU while producing 10 ~ 20 fps.
The desktop eventually becomes unusable after 15 minutes or so.
This is basically the same behaviour as seen here
The exact same behaviour is also observed on a separate system with an AMD RX580 4GB.
Tested configuration:
With memory leak:
Fedora 31:
Renderer: AMD Radeon VII (VEGA20, DRM 3.33.0, 5.3.16-300.fc31.x86_64+debug, LLVM 9.0.0)
Version: 4.5 (Compatibility Profile) Mesa 19.2.8
Ubuntu 18.04.3 LTS:
Renderer: Radeon RX 580 Series (POLARIS10, DRM 3.27.0, 5.0.0-37-generic, LLVM 9.0.0)
Version: 4.5 (Compatibility Profile) Mesa 19.2.1
Results are the same with modesetting
an the amdgpu
driver.
Without memory leak: Ubuntu 18.04.3 LTS:
Renderer: Radeon RX 580 Series (POLARIS10, DRM 3.27.0, 5.0.0-37-generic, LLVM 8.0.0)
Version: 4.5 (Compatibility Profile) Mesa 19.1.4
Version: 4.5 (Compatibility Profile) Mesa 19.0.8
It seems that the memory leak is potentially reintroduced in 19.2.0
?
The bug is not reproducible on both i915
, nouveau
and also NVIDIA's proprietary driver.
How to reproduce:
On a machine with an AMD graphics card, ideally VEGA20 or POLARIS based as those are the ones that I have confirmed to have the bug. Ensure Java and JavaFX is installed:
Fedora: sudo dnf install java-1.8.0-openjdk-openjfx java-1.8.0-openjdk
Ubuntu: sudo apt install openjfx openjdk-8-jdk
Run the attached Leak.java:
GALLIUM_HUD=fps,requested-VRAM,VRAM-usage javac Leak.java && java -Dprism.verbose=true Leak
Observe the increasing VRAM usage in the HUD or use something like radeontop
.
Alternatively, run any JavaFX program such as this one, try to trigger redraws by resizing windows or scrolling content.
I've taken a few stab at solving this:
-
Based on existing discussion here which is also referenced in the JDK mailing lists above, it seems that
glXMakeCurrent
is somehow involved. I've commented out that call in JavaFX and recompiled the native libraries. This resolved the issue but then the app will simply crash with more than one windows open. -
I've tried to mirror the exact GLX+GL calls in a sample C++ project (main.cpp) in an attempt to fully reproduce the issue in a self-contained manner based on the understanding that
glXMakeCurrent
is where the leak occurs. The sample app unfortunately did not cause any memory leak.
Attachments
I've attached 2 different apitraces:
-
dri3.trace - running the app with no flags (
LIBGL_DRI3_DISABLE=0
) -
dri3_disabled.trace - running the app with
LIBGL_DRI3_DISABLE=1
Would be grateful if someone can take a quick look at this. I've already spend days one this but can't pinpoint why the leak occurs.