Skip to content

lima: support performance counters through GL_AMD_performance_monitor

Erico Nunes requested to merge enunes/mesa:lima-hw-perf-counters into main

This adds support for querying performance counters on the Mali Utgard through the GL_AMD_performance_monitor extension. This heavily borrows from the vc4 implementation, both userspace and kernel.

The Mali 4xx exposes performance counters for the pp, gp and l2 cache cores. The performance counter event list and descriptions have been retrieved from ARM's open source "gator" software available in github. https://github.com/ARM-software/gator/blob/master/daemon/events-Mali-4xx.xml

Support for this feature depends on availability of corresponding patches in the lima kernel driver.

I started working on this to have a way to better evaluate the impact of some of the changes in the ppir compiler optimizations series. This v1 is still somewhat work in progress, but it is already a working version.

This can be tested with a recent version of apitrace and --pframes. With a simple trace like kmscube.trace and a predictable counter:

$ apitrace replay --pframes 'GL_AMD_performance_monitor:lima_gp_vertices_fetched' kmscube.trace
#	lima_gp_vertices_fetched
frame	0
frame	24
frame	24
frame	24
frame	24
frame	24
frame	24
frame	24
frame	24
frame	24
frame	24
...

Multiple counters can be specified at the same time (note that only 2 counter sources can be enabled per core in the Mali 4xx):

$ apitrace replay --pframes 'GL_AMD_performance_monitor:lima_gp_vertices_fetched,lima_l2_cache_words_written_all_slaves,lima_pp_program_cache_hit_count' kmscube.trace
#	lima_gp_vertices_fetched	lima_l2_cache_words_written_all_slaves	lima_pp_program_cache_hit_count
frame	0	16320	0
frame	24	30748	150580
frame	24	30848	151072
frame	24	30848	155708
frame	24	30904	153604
frame	24	30904	155548
frame	24	30876	154252
frame	24	30876	157252
frame	24	30876	156900
frame	24	30876	151456
frame	24	30536	152640

I would appreciate some feedback on this as of this state. Mainly on these topics:

  • Ensuring that job execution never counts events that are on different jobs. Also ensure that the jobs have finished executing when mesa requests the counter values. I have already tested this by running multiple counting instances simultaneously and by running e.g. Xorg with several graphical applications in the background at once and then start profiling. That seems to work, but more input is welcome.
  • (mostly in the kernel) Counters on l2 cache are a little more tricky, it is not clear to me what is the best time to start and reset counting. pp and gp can be done for the duration of the task run and that seems to work. It seems that ARM's streamline/gator stack uses the gator kernel module to call directly into the mali kernel module to start and reset the l2 cache counters.

Kernel side is at: enunes/linux@2a5d90ac

Merge request reports