Skip to content
  • Francisco Jerez's avatar
    i965: Don't tell the hardware about our UAV access. · 5346c116
    Francisco Jerez authored
    The hardware documentation relating to the UAV HW-assisted coherency
    mechanism and UAV access enable bits is scarce and sometimes
    contradictory, and there's quite some guesswork behind this commit, so
    let me summarize the background first: HSW and later hardware have
    infrastructure to support a stricter form of data coherency between
    shader invocations from separate primitives.  The mechanism is
    controlled by the "Accesses UAV" bits on 3DSTATE_VS, _HS, _DS, _GS and
    _PS (or _PS_EXTRA on BDW+), and the "UAV Coherency Required" bit on
    the 3DPRIMITIVE command.
    
    Regardless of whether "UAV Coherency Required" is set, the hardware
    fixed-function units will increment a per-stage semaphore for each
    request received if "Accesses UAV" is set for the same or any lower
    stage.  An implicit DC flush is emitted by the lowermost stage with
    "Accesses UAV" set once it's done processing the request, this also
    happens regardless of the value of "UAV Coherency Required".  The
    completion of the DC flush will cause the same stage and all previous
    ones to decrement the semaphore, marking the UAV accesses for the
    primitive as coherent with L3.
    
    The "UAV Coherency Required" 3DPRIMITIVE bit will cause a pipeline
    stall before any threads are dispatched for the first FF stage with
    "Accesses UAV" set until the semaphore is cleared for the same stage.
    Effectively this guarantees that UAV memory accesses performed by
    previous primitives from any stage will be strictly ordered (and
    thanks to the implicit DC flush visible in memory) with UAV accesses
    from the following primitives.
    
    None of this is required by the usual image, atomic counter and SSBO
    GL APIs which have very relaxed cross-primitive coherency and ordering
    requirements, so we don't actually ever set the "UAV Coherency
    Required" bit -- Ordering with respect to shader invocations from
    previous stages on the same primitive where there is a data dependency
    is of course already guaranteed as the spec requires, regardless of
    this mechanism being enabled.  We do set the "Accesses UAV" bits
    though since my commit ac7664e4 (which
    this patch partially reverts), mainly because of comments like the
    following from the BDW PRM:
    
    > 3DSTATE_GS
    >[...]
    > 12 Accesses UAV
    >    Format: Enable
    >    This field must be set when GS has a UAV access.
    
    There are similar comments in the documentation for the other
    3DSTATE_*S commands.  The "must" part is misleading and unjustified
    AFAIK.  Most of the "Accesses UAV" bits don't seem to have any side
    effects other than the implicit DC flushes and the related
    book-keeping in anticipation for a subsequent primitive with "UAV
    Coherency Required" set, so in most cases they are unnecessary and may
    incur a performance penalty.  There is an exception though.  On Gen8+
    the PS_EXTRA UAV access bit influences the calculation of the PS
    UAV-only and ThreadDispatchEnable signals which on previous
    generations were set explicitly by the driver, so we cannot always
    avoid enabling it on the PS stage.
    
    The primary motivation for this change is that in fact the hardware
    coherency mechanism is buggy and will cause a rather non-deterministic
    hang on Gen8 when VS is the only stage with "Accesses UAV" set and the
    processing of a request terminates immediately after the implicit DC
    flush is sent for a previous primitive with no additional vertices
    being emitted for the second primitive, what will cause the hardware
    to skip sending a second DC flush and cause the VS to stall
    indefinitely waiting for a response from the DC (BDWGFX HSD 1912017).
    This hardware bug can be reproduced on current master with the
    spec@arb_shader_image_load_store@host-mem-barrier@Indirect/RaW piglit
    subtest (if you have the patience to run it a few dozen times).
    
    The proposed workaround is to insert CS STALLs speculatively between
    3DPRIMITIVE commands when "Accesses UAV" is enabled for the VS stage
    only.  Because this would affect one of the hottest paths in the
    driver and likely decrease performance even further due to the
    unnecessary serialization, and because we don't actually need the
    implicit DC flushes, it seems better to just disable them.
    
    Cc: 11.0 <mesa-stable@lists.freedesktop.org>
    5346c116