Crosvm crashes after EGL context switch failure
A week ago we tried to merge !889 (merged) that enables persistent mapping support for llvmpipe and found that it makes all the guest virgl CI tests to fail because crosvm crashes on host with the Couldn't find current GLX or EGL context
error coming from libepoxy. In fact, we at Collabora experienced this libepoxy crash in the past, but it was difficult to reproduce.
After taking a closer look at this trouble, I found that EGL context switching fails in virglrender and we don't handle this error at all (assuming it will never happen), and thus, there is no EGL context found after the failed ctx-switch, which makes libepoxy unhappy. I found an easy to reproduce this problem locally by starting Xorg in guest and then killing it immediately.
#0 0x00007ffff7cb1c4c in __pthread_kill_implementation () from /lib64/libc.so.6
#1 0x00007ffff7c619c6 in raise () from /lib64/libc.so.6
#2 0x00007ffff7c4b7f4 in abort () from /lib64/libc.so.6
#3 0x00007ffff7c4b71b in __assert_fail_base.cold () from /lib64/libc.so.6
#4 0x00007ffff7c5a576 in __assert_fail () from /lib64/libc.so.6
#5 0x00007ffff795f767 in epoxy_get_proc_address (name=0x7ffff797b8c5 <entrypoint_strings+60005> "glUnmapBuffer") at ../src/dispatch_common.c:878
#6 0x00007ffff7950f2a in epoxy_glUnmapBuffer_resolver () at src/gl_generated_dispatch.c:108352
#7 epoxy_glUnmapBuffer_global_rewrite_ptr (target=34962) at src/gl_generated_dispatch.c:51774
#8 0x00007ffff7b9b3fd in vrend_renderer_resource_unmap (pres=0x7ffebd16dcd0) at ../src/vrend_renderer.c:12209
#9 0x00007ffff7b7281e in virgl_renderer_resource_unmap (res_handle=8) at ../src/virglrenderer.c:1060
#10 0x000055555672e437 in rutabaga_gfx::virgl_renderer::unmap_func (resource_id=8) at rutabaga_gfx/src/virgl_renderer.rs:262
#11 0x00005555568bef1e in base::external_mapping::{impl#3}::drop (self=0x555557260fb0) at base/src/external_mapping.rs:87
#12 0x000055555663ba7b in core::ptr::drop_in_place<base::external_mapping::ExternalMapping> ()
at /builddir/build/BUILD/rustc-1.62.1-src/library/core/src/ptr/mod.rs:486
#13 0x00005555567cb2dd in core::ptr::drop_in_place<alloc::boxed::Box<dyn base::sys::unix::mmap::MappedRegion, alloc::alloc::Global>> ()
at /builddir/build/BUILD/rustc-1.62.1-src/library/core/src/ptr/mod.rs:486
#14 0x00005555558ca775 in core::ptr::drop_in_place<core::result::Result<alloc::boxed::Box<dyn base::sys::unix::mmap::MappedRegion, alloc::alloc::Global>, base::errno::Error>> () at /builddir/build/BUILD/rustc-1.62.1-src/library/core/src/ptr/mod.rs:486
#15 0x000055555584b040 in vm_control::VmMemoryRequest::execute<hypervisor::kvm::KvmVm> (self=..., vm=0x7fffffff7f48, sys_allocator=0x7fffffff7f80,
map_request=..., gralloc=0x7fffffff8940) at /home/dima/project/vm/chrome/platform/crosvm2/vm_control/src/lib.rs:417
#16 0x00005555558f3190 in crosvm::platform::run_control<hypervisor::kvm::KvmVm, hypervisor::kvm::KvmVcpu> (linux=..., sys_allocator=..., cfg=...,
control_server_socket=..., control_tubes=..., balloon_host_tube=..., disk_host_tubes=..., usb_control_tube=..., vm_evt_rdtube=...,
vm_evt_wrtube=..., sigchld_fd=..., map_request=..., gralloc=..., vcpu_ids=..., iommu_host_tube=...) at src/linux/mod.rs:2021
#17 0x00005555558eba48 in crosvm::platform::run_vm<hypervisor::kvm::KvmVcpu, hypervisor::kvm::KvmVm> (cfg=..., components=..., vm=...,
irq_chip=..., ioapic_host_tube=...) at src/linux/mod.rs:1476
#18 0x000055555583e112 in crosvm::platform::run_kvm (cfg=..., components=..., guest_mem=...) at src/linux/mod.rs:1039
#19 0x000055555583e939 in crosvm::platform::run_config (cfg=...) at src/linux/mod.rs:1059
#20 0x00005555557d3724 in crosvm::run_vm<fn(&mut env_logger::fmt::Formatter, &log::Record) -> core::result::Result<(), std::io::error::Error>> (
args=..., log_config=...) at src/main.rs:2520
#21 0x00005555557c3271 in crosvm::crosvm_main () at src/main.rs:3206
#22 0x00005555557c3da9 in crosvm::main () at src/main.rs:3254
The problem lies in crosvm that wants to unmap blob resource from the main crosvm
thread while EGL context is held by the virtio_gpu
thread, hence EGL context switching fails for the crosvm
thread.
Another potential problem here is that crosvm wants to call virglrenderer from two independent threads and virglrenderer isn't thread-safe.
We confirmed that fixing EGL context switches makes CI tests to pass by adding a hack to virglrenderer that makes it to skip the offending unmaps on the ctx-switch failure.
before: https://gitlab.freedesktop.org/virgl/virglrenderer/-/pipelines/671893
after: https://gitlab.freedesktop.org/digetx/virglrenderer/-/pipelines/673084
This is really a crosvm bug, but we don't have a public issue tracker for crosvm (don't we?), so I'm proposing to discuss it here.
A potential solution could be to stop using virglrenderer outside of the virtio_gpu
thread, but it's unclear how to achieve this in crosvm.
@Fahien @gerddie @ryanneph @zzyiwei @olv @chadversary @gurchetansingh