Skip to content
  • Francisco Jerez's avatar
    intel/fs: Optimize and simplify the copy propagation dataflow logic. · 11674dad
    Francisco Jerez authored
    Previously the dataflow propagation algorithm would calculate the ACP
    live-in and -out sets in a two-pass fixed-point algorithm.  The first
    pass would update the live-out sets of all basic blocks of the program
    based on their live-in sets, while the second pass would update the
    live-in sets based on the live-out sets.  This is incredibly
    inefficient in the typical case where the CFG of the program is
    approximately acyclic, because it can take up to 2*n passes for an ACP
    entry introduced at the top of the program to reach the bottom (where
    n is the number of basic blocks in the program), until which point the
    algorithm won't be able to reach a fixed point.
    
    The same effect can be achieved in a single pass by computing the
    live-in and -out sets in lock-step, because that makes sure that
    processing of any basic block will pick up the updated live-out sets
    of the lexically preceding blocks.  This gives the dataflow
    propagation algorithm effectively O(n) run-time instead of O(n^2) in
    the acyclic case.
    
    The time spent in dataflow propagation is reduced by 30x in the
    GLES31.functional.ssbo.layout.random.all_shared_buffer.5 dEQP
    test-case on my CHV system (the improvement is likely to be of the
    same order of magnitude on other platforms).  This more than reverses
    an apparent run-time regression in this test-case from my previous
    copy-propagation undefined-value handling patch, which was ultimately
    caused by the additional work introduced in that commit to account for
    undefined values being multiplied by a huge quadratic factor.
    
    According to Chad this test was failing on CHV due to a 30s time-out
    imposed by the Android CTS (this was the case regardless of my
    undefined-value handling patch, even though my patch substantially
    exacerbated the issue).  On my CHV system this patch reduces the
    overall run-time of the test by approximately 12x, getting us to
    around 13s, well below the time-out.
    
    v2: Initialize live-out set to the universal set to avoid rather
        pessimistic dataflow estimation in shaders with cycles (Addresses
        performance regression reported by Eero in GpuTest Piano).
        Performance numbers given above still apply.  No shader-db changes
        with respect to master.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104271
    
    
    Reported-by: default avatarChad Versace <chadversary@chromium.org>
    Reviewed-by: default avatarIan Romanick <ian.d.romanick@intel.com>
    11674dad