Skip to content

util: Add u_trace_compare script for perf analysis

The intent is to provide an easy way to measure the impact of an optimization, not by measuring the whole workload completion time but also by measuring certain chunks of the workload like command buffers, renderpasses, or even separate draws.

A moderate perf win in a rare case may not translate into statistically signifacant overall result. An optimization also may hurt perf in some cases and help in other which is also hard to judge from overall perf.

For best results pin cpu/gpu frequencies and disable gpu suspend. Exclude all unnecessary tracepoints via TU_GPU_TRACEPOINT.

Usage:

       u_trace_gather.py gather_all \
        --loops 1 --launcher "renderdoccmd replay --loops 12" \
        --traces-list /path/to/traces.txt \
        --traces-dir /path/to/dir/with/traces/ \
        --results /path/to/results/ \
        --alias new-shiny-opt
    
       u_trace_compare.py compare \
        --results /path/to/results/ \
        --loops-merged true \
        --alias-a default \
        --alias-b new-shiny-opt \
        --event-start start_render_pass \
        --event-end end_render_pass \
        --filter "int(params['drawCount']) > 10"

This will print helped and hurt renderpasses and overall perf gain.

To dump all data in csv format use "--csv file_name".

Example text output:

[anno5-low.rdc] start_render_pass ∇ 385.7% ks: 1.0 pval: 0.0000
        measurements_sysmem: (191204, 177892, 182052, 178620, 178828, 179868, 176956, 179296, 177684)
        measurements_gmem: (875264, 871936, 878904, 875732, 875680, 875680, 875628, 873808, 876720)
        {'width': '1920', 'height': '1080', 'attachment_count': '3', 'numberOfBins': '10', 'binWidth': '384', 'binHeight': '544', 'maxSamples': '1', 'clearCPP': '0', 'loadCPP': '12', 'storeCPP': '12', 'tiledRender': '0', 'drawCount': '150', 'totalPerSampleBandwidth': '482'}

[anno5-low.rdc] start_render_pass ∇ 502.4% ks: 1.0 pval: 0.0000
        measurements_sysmem: (31876, 32552, 32656, 31096, 31928, 32292, 32656, 31928, 32188)
        measurements_gmem: (192452, 193284, 193232, 193544, 194740, 192556, 194168, 194012, 193960)
        {'width': '1920', 'height': '1080', 'attachment_count': '2', 'numberOfBins': '7', 'binWidth': '288', 'binHeight': '1088', 'maxSamples': '1', 'clearCPP': '0', 'loadCPP': '8', 'storeCPP': '8', 'tiledRender': '0', 'drawCount': '13', 'totalPerSampleBandwidth': '39'}

[GTAV_normal_1.rdc] start_render_pass ∇ 755.0% ks: 1.0 pval: 0.0000
        measurements_sysmem: (61516, 62764, 63076, 63128, 62296, 63232, 61880, 62140, 61360)
        measurements_gmem: (532896, 529412, 529204, 531336, 533208, 533728, 538824, 532636, 538824)
        {'width': '2560', 'height': '1440', 'attachment_count': '2', 'numberOfBins': '18', 'binWidth': '1344', 'binHeight': '160', 'maxSamples': '1', 'clearCPP': '0', 'loadCPP': '12', 'storeCPP': '12', 'tiledRender': '0', 'drawCount': '19', 'totalPerSampleBandwidth': '100'}

[DyingLight-low.rdc] start_render_pass ∇ 863.3% ks: 1.0 pval: 0.0000
        measurements_sysmem: (28340, 28444, 28808, 28080, 29328, 28132, 28600, 28028, 29016)
        measurements_gmem: (275288, 274924, 275444, 274924, 274456, 274716, 274300, 274248, 275288)
        {'width': '1366', 'height': '768', 'attachment_count': '3', 'numberOfBins': '5', 'binWidth': '1536', 'binHeight': '160', 'maxSamples': '1', 'clearCPP': '0', 'loadCPP': '9', 'storeCPP': '9', 'tiledRender': '0', 'drawCount': '11', 'totalPerSampleBandwidth': '44'}

TOTAL: 39 ∇ 46.83290%
TOTAL HELPED: 5 Δ -6.4% (where 'gmem' is faster than 'sysmem')
TOTAL HURT: 31 ∇ 87.8% (where 'gmem' is slower than 'sysmem')
FILTER: int(params['drawCount']) > 10

This is a third incarnation of this script. Now I believe it is usable. Aside from a few improvements to the script itself the following was done:

  • !29220 (merged) - makes possible proper frame separation with vk + u_trace, so we could compare looped frames from a single renderdoccmd replay.
  • https://github.com/doitsujin/dxvk/pull/4003 - an option for DXVK to output the same commands for the same frames, without it the same frame could yield different number of renderpasses and renderpasses with different number of draw calls.
Edited by Danylo Piliaiev

Merge request reports