Skip to content

freedreno: perfetto pps perfcounters and gpu renderstage support

Rob Clark requested to merge robclark/mesa:fd/perfetto into main

This adds the freedreno support on top of !9652 (merged). There are two main components of this:

  1. The pps datasource which collects performance counter "countable" values, and calculates derived counters. This runs as a separate process (as root) to globally collect counters. Initially there is just a small set of derived counters exposed (but enough to confirm that we are getting timestamps of GPU and CPU trace events aligned properly. The full set of counters that the blob driver exposes (to give an idea of what is missing) is here. Adding additional derived counters is relatively straight forward once the correct formula is derived, for ex:
   auto PERF_CP_ALWAYS_COUNT = countable("PERF_CP_ALWAYS_COUNT");
   auto PERF_CP_BUSY_CYCLES  = countable("PERF_CP_BUSY_CYCLES");
   auto PERF_RB_3D_PIXELS    = countable("PERF_RB_3D_PIXELS");
   auto PERF_SP_FS_STAGE_FULL_ALU_INSTRUCTIONS = countable("PERF_SP_FS_STAGE_FULL_ALU_INSTRUCTIONS");
   auto PERF_SP_FS_STAGE_HALF_ALU_INSTRUCTIONS = countable("PERF_SP_FS_STAGE_HALF_ALU_INSTRUCTIONS");
   auto PERF_TP_L1_CACHELINE_MISSES = countable("PERF_TP_L1_CACHELINE_MISSES");
   auto PERF_SP_BUSY_CYCLES  = countable("PERF_SP_BUSY_CYCLES");

   /*
    * And then setup the derived counters that we are exporting to
    * pps based on the captured countable values
    */

   counter("GPU Frequency", Counter::Units::Hertz, [=]() {
         return PERF_CP_ALWAYS_COUNT / time;
      }
   );

   counter("GPU % Utilization", Counter::Units::Percent, [=]() {
         return 100.0 * (PERF_CP_BUSY_CYCLES / time) / max_freq;
      }
   );

   // This one is a bit of a guess, but seems plausible..
   counter("ALU / Fragment", Counter::Units::None, [=]() {
         return (PERF_SP_FS_STAGE_FULL_ALU_INSTRUCTIONS +
               PERF_SP_FS_STAGE_HALF_ALU_INSTRUCTIONS / 2) / PERF_RB_3D_PIXELS;
      }
   );

   counter("TP L1 Cache Misses", Counter::Units::None, [=]() {
         return PERF_TP_L1_CACHELINE_MISSES / time;
      }
   );

   counter("Shader Core Utilization", Counter::Units::Percent, [=]() {
         return 100.0 * (PERF_SP_BUSY_CYCLES / time) / (max_freq * info.num_sp_cores);
      }
   );
  1. The in-mesa component which generates submit-event and render-stage traces, along with periodic clock-sync events so that perfetto can synchronize between the GPU and CPU clock domains. The GPU render-stage traces are built upon u_trace which already provides a mechanism for collecting GPU timestamps on the GPU and associating them with trace data.

The end result is something like:

agi-surface-detail

If you select the surface track, you can see details about the renderpass: surface format, size, number of bins (tiles) and tile size, etc. The part that is (misleadingly[*]) labeled "Vulkan Events" shows whwere submits happen on the CPU. If you highlight one it will show you an arrow pointing to where that submit started running on the GPU:

agi-submit

[*] I am abusing "vk_queue_submit" events for this.. for gl it is a bit harder to tie a submit directly to an API event, but it is still useful to be able to connect submits on the CPU to execution on the GPU

The GUI side of this is pretty agnostic to exactly what sort of renderstages the GPU driver exposes. You send upfront some trace events that "teach" the UI about what renderstage traces and performance counters the driver will be sending, so it is pretty straightforward to expose hw specific things.

Edited by Rob Clark

Merge request reports