util: GPU tracepoints
This MR adds a tracepoint mechanism loosely inspired by the kernel's tracepoint mechanism, with a few notable differences:
- GPU timestamps! A driver provided callback is used to emit timestamps to a buffer. At a later point in time (when stalling to wait for the GPU is not required), the timestamps are re-united with the trace payload. This makes the trace mechanism suitable for profiling.
- Instead of a systemwide trace ringbuffer, buffering of un-retired tracepoints is split into two stages. Traces are emitted to a
u_trace
instance, and at a later time flushed to au_trace_context
instance. This avoids the requirement that commandstream containing tracepoints is emitted in the same order as it is generated. - Unlike kernel tracepoints,
u_trace
tracepoints are defined in py from which header and src files are generated. Since we already have a build dependency on python+mako, this gives more flexibility than clunky preprocessor macro magic.
The mechanism for capturing timestamps on the GPU is based on freedreno's fd_log()
mechanism. But unlike fd_log()
the trace payload is binary. The intention is to (waves hands) add a perfetto backend as an alternative to the current printf backend, to ship this data out to an external tool (AGI?) so data can be visualized in a nice timeline view like:
Still a bit WIP. I haven't finished converting all the fd_log()
tracepoints and removing fd_log()
. There are a few rough edges. And notably for tiler GPUs there isn't yet support for traces in cmdstream that is executed per-tile (ie. so we can use this to instrument tile passes, but not individual draws within a tile-pass). I'm not entirely sure how we'd want to visualize that in the first place. For per-draw instrumentation on a tiler, we probably want something that accumulates elapsed time rather than absolute timestamps (and at that point, it could simply re-use elapsed-time queries).
Another open question is how we could make this something that could be re-used by vk drivers. Perhaps with a bit of an internal shim, this could be (waves hands again) built twice, once using gallium APIs and a second time using vk APIs?
Currently, with the freedreno tracepoints that I've added so far, GALLIUM_GPU_TRACE=1
gets you output like:
+----- NS -----+ +-- Δ --+ +----- MSG -----
0000003067732720 +0: flush_batch: 0x17119720: cleared=5, gmem_reason=3, num_draws=2
0000003067732720 +0: framebuffer: 800x600x0@1, nr_cbufs: 1
0000003067732720 +0: surface: 800x600@0, fmt=b8g8r8a8_unorm
0000003067732720 +0: surface: 800x600@0, fmt=z24x8_unorm
0000003067732720 +0: render_gmem: 1x4 bins of 800x160
0000003067732720 +0: start_state_restore
0000003067734904 +2184: end_state_restore
0000003067735164 +260: start_prologue
0000003067755496 +20332: end_prologue
0000003067757732 +2236: start_binning_ib
0000003067848836 +91104: end_binning_ib
0000003067854244 +5408: start_vsc_overflow_test
0000003067862876 +8632: end_vsc_overflow_test
0000003067866204 +3328: start_tile: bin_h=160, yoff=0, bin_w=800, xoff=0
0000003067867660 +1456: start_clear_restore: fast_cleared: 0x5
0000003067875148 +7488: end_clear_restore
0000003067875148 +0: start_draw_ib
0000003067898652 +23504: end_draw_ib
0000003067898860 +208: start_resolve
0000003067940200 +41340: end_resolve
0000003067940200 +0: start_tile: bin_h=160, yoff=160, bin_w=800, xoff=0
0000003067942436 +2236: start_clear_restore: fast_cleared: 0x5
0000003067948884 +6448: end_clear_restore
0000003067948884 +0: start_draw_ib
0000003067977328 +28444: end_draw_ib
0000003067977588 +260: start_resolve
0000003068020332 +42744: end_resolve
0000003068020332 +0: start_tile: bin_h=160, yoff=320, bin_w=800, xoff=0
0000003068022568 +2236: start_clear_restore: fast_cleared: 0x5
0000003068029068 +6500: end_clear_restore
0000003068029068 +0: start_draw_ib
0000003068085436 +56368: end_draw_ib
0000003068085696 +260: start_resolve
0000003068128700 +43004: end_resolve
0000003068128700 +0: start_tile: bin_h=120, yoff=480, bin_w=800, xoff=0
0000003068130832 +2132: start_clear_restore: fast_cleared: 0x5
0000003068138684 +7852: end_clear_restore
0000003068138684 +0: start_draw_ib
0000003068184808 +46124: end_draw_ib
0000003068185068 +260: start_resolve
0000003068217152 +32084: end_resolve
ELAPSED: 484432 ns
END OF FRAME 362
(for a single renderpass)