lima: support performance counters through GL_AMD_performance_monitor
This adds support for querying performance counters on the Mali Utgard through the GL_AMD_performance_monitor extension. This heavily borrows from the vc4 implementation, both userspace and kernel.
The Mali 4xx exposes performance counters for the pp, gp and l2 cache cores. The performance counter event list and descriptions have been retrieved from ARM's open source "gator" software available in github. https://github.com/ARM-software/gator/blob/master/daemon/events-Mali-4xx.xml
Support for this feature depends on availability of corresponding patches in the lima kernel driver.
I started working on this to have a way to better evaluate the impact of some of the changes in the ppir compiler optimizations series. This v1 is still somewhat work in progress, but it is already a working version.
This can be tested with a recent version of apitrace and
--pframes. With a simple trace like kmscube.trace and a predictable counter:
$ apitrace replay --pframes 'GL_AMD_performance_monitor:lima_gp_vertices_fetched' kmscube.trace # lima_gp_vertices_fetched frame 0 frame 24 frame 24 frame 24 frame 24 frame 24 frame 24 frame 24 frame 24 frame 24 frame 24 ...
Multiple counters can be specified at the same time (note that only 2 counter sources can be enabled per core in the Mali 4xx):
$ apitrace replay --pframes 'GL_AMD_performance_monitor:lima_gp_vertices_fetched,lima_l2_cache_words_written_all_slaves,lima_pp_program_cache_hit_count' kmscube.trace # lima_gp_vertices_fetched lima_l2_cache_words_written_all_slaves lima_pp_program_cache_hit_count frame 0 16320 0 frame 24 30748 150580 frame 24 30848 151072 frame 24 30848 155708 frame 24 30904 153604 frame 24 30904 155548 frame 24 30876 154252 frame 24 30876 157252 frame 24 30876 156900 frame 24 30876 151456 frame 24 30536 152640
I would appreciate some feedback on this as of this state. Mainly on these topics:
- Ensuring that job execution never counts events that are on different jobs. Also ensure that the jobs have finished executing when mesa requests the counter values. I have already tested this by running multiple counting instances simultaneously and by running e.g. Xorg with several graphical applications in the background at once and then start profiling. That seems to work, but more input is welcome.
- (mostly in the kernel) Counters on l2 cache are a little more tricky, it is not clear to me what is the best time to start and reset counting. pp and gp can be done for the duration of the task run and that seems to work. It seems that ARM's streamline/gator stack uses the gator kernel module to call directly into the mali kernel module to start and reset the l2 cache counters.
Kernel side is at: enunes/linux@2a5d90ac