radv: RADV_DEBUG=hang should provide some way to analyze descriptors and VA ranges
When debugging games in vkd3d-proton, the most common GPU-hang issue we run into is a crash when accessing descriptors. Either the descriptor is bogus, or the memory backing the descriptor has since been freed. RADV_DEBUG=hang with UMR wave dumps is very helpful, but it does not give good information about these two problems, which would help track down issues more effectively.
E.g. in a UMR dump, we can see things like:
pgm[1@0x8001cda66000 + 0xf58 ] = 0xf0880708 image_sample_d v[66:68], v[76:83], s[60:67], s[16:19] dmask:0x7 dim:SQ_RSRC_IMG_2D
We can fortunately see the registers, e.g.:
[ 60.. 63] = { 013cf800, caa00080, 801fc01f, 91670fac }
[ 64.. 67] = { 00000000, 00400070, 00000000, 00000000 }
There are two main problems here from a practical debugging POV:
- It is very hard to parse the descriptors by eye. While we could glean the VA from this fairly easily, it would be nice to be able to see disassembled descriptors somehow in the dump.
- It is not possible to determine if the VA is valid. Was it ever valid? Was it freed at some point? These questions come up again and again when attempting to debug these issues. Perhaps dumping information from the BO allocator would be nice as well.
Some other "trivial" problems:
- It would be great to get a full ISA dump of a failed shader, so it's easier to back-correlated this to the dumped SPIR-V disassembly. For large shaders, it can be very hard to get the full context.