[Mesa-dev] [PATCH] i965: Don't tell the hardware about our UAV access.

Francisco Jerez currojerez at riseup.net
Sat Aug 15 02:24:37 PDT 2015


Roland Scheidegger <sroland at vmware.com> writes:

> I guess though you'd need these bits when implementing things like
> ARB_fragment_shader_ordering? (Thus stuff actually looks useful but I
> don't know if anybody wants to implement it in the near term.)
>
I believe that could be implemented using the thread dependency
mechanism which is independent of the UAV coherency stuff.  BTW I wasn't
aware that the extension had become ARB.

> Roland
>
>
>
> Am 14.08.2015 um 18:27 schrieb Francisco Jerez:
>> The hardware documentation relating to the UAV HW-assisted coherency
>> mechanism and UAV access enable bits is scarce and sometimes
>> contradictory, and there's quite some guesswork behind this commit, so
>> let me summarize the background first: HSW and later hardware have
>> infrastructure to support a stricter form of data coherency between
>> shader invocations from separate primitives.  The mechanism is
>> controlled by the "Accesses UAV" bits on 3DSTATE_VS, _HS, _DS, _GS and
>> _PS (or _PS_EXTRA on BDW+), and the "UAV Coherency Required" bit on
>> the 3DPRIMITIVE command.
>> 
>> Regardless of whether "UAV Coherency Required" is set, the hardware
>> fixed-function units will increment a per-stage semaphore for each
>> request received if "Accesses UAV" is set for the same or any lower
>> stage.  An implicit DC flush is emitted by the lowermost stage with
>> "Accesses UAV" set once it's done processing the request, this also
>> happens regardless of the value of "UAV Coherency Required".  The
>> completion of the DC flush will cause the same stage and all previous
>> ones to decrement the semaphore, marking the UAV accesses for the
>> primitive as coherent with L3.
>> 
>> The "UAV Coherency Required" 3DPRIMITIVE bit will cause a pipeline
>> stall before any threads are dispatched for the first FF stage with
>> "Accesses UAV" set until the semaphore is cleared for the same stage.
>> Effectively this guarantees that UAV memory accesses performed by
>> previous primitives from any stage will be strictly ordered (and
>> thanks to the implicit DC flush visible in memory) with UAV accesses
>> from the following primitives.
>> 
>> None of this is required by the usual image, atomic counter and SSBO
>> GL APIs which have very relaxed cross-primitive coherency and ordering
>> requirements, so we don't actually ever set the "UAV Coherency
>> Required" bit -- Ordering with respect to shader invocations from
>> previous stages on the same primitive where there is a data dependency
>> is of course already guaranteed as the spec requires, regardless of
>> this mechanism being enabled.  We do set the "Accesses UAV" bits
>> though since my commit ac7664e493655e290783c23a0412b9c70936da50 (which
>> this patch partially reverts), mainly because of comments like the
>> following from the BDW PRM:
>> 
>>> 3DSTATE_GS
>>> [...]
>>> 12 Accesses UAV
>>>    Format: Enable
>>>    This field must be set when GS has a UAV access.
>> 
>> There are similar comments in the documentation for the other
>> 3DSTATE_*S commands.  The "must" part is misleading and unjustified
>> AFAIK.  Most of the "Accesses UAV" bits don't seem to have any side
>> effects other than the implicit DC flushes and the related
>> book-keeping in anticipation for a subsequent primitive with "UAV
>> Coherency Required" set, so in most cases they are unnecessary and may
>> incur a performance penalty.  There is an exception though.  On Gen8+
>> the PS_EXTRA UAV access bit influences the calculation of the PS
>> UAV-only and ThreadDispatchEnable signals which on previous
>> generations were set explicitly by the driver, so we cannot always
>> avoid enabling it on the PS stage.
>> 
>> The primary motivation for this change is that in fact the hardware
>> coherency mechanism is buggy and will cause a rather non-deterministic
>> hang on Gen8 when VS is the only stage with "Accesses UAV" set and the
>> processing of a request terminates immediately after the implicit DC
>> flush is sent for a previous primitive with no additional vertices
>> being emitted for the second primitive, what will cause the hardware
>> to skip sending a second DC flush and cause the VS to stall
>> indefinitely waiting for a response from the DC (BDWGFX HSD 1912017).
>> This hardware bug can be reproduced on current master with the
>> spec at arb_shader_image_load_store@host-mem-barrier at Indirect/RaW piglit
>> subtest (if you have the patience to run it a few dozen times).
>> 
>> The proposed workaround is to insert CS STALLs speculatively between
>> 3DPRIMITIVE commands when "Accesses UAV" is enabled for the VS stage
>> only.  Because this would affect one of the hottest paths in the
>> driver and likely decrease performance even further due to the
>> unnecessary serialization, and because we don't actually need the
>> implicit DC flushes, it seems better to just disable them.
>> ---
>>  src/mesa/drivers/dri/i965/gen7_gs_state.c |  4 +---
>>  src/mesa/drivers/dri/i965/gen7_vs_state.c |  4 +---
>>  src/mesa/drivers/dri/i965/gen7_wm_state.c | 12 ++++++++----
>>  src/mesa/drivers/dri/i965/gen8_gs_state.c |  4 +---
>>  src/mesa/drivers/dri/i965/gen8_ps_state.c | 32 ++++++++++++++++++++++++++++---
>>  src/mesa/drivers/dri/i965/gen8_vs_state.c |  4 +---
>>  6 files changed, 41 insertions(+), 19 deletions(-)
>> 
>> diff --git a/src/mesa/drivers/dri/i965/gen7_gs_state.c b/src/mesa/drivers/dri/i965/gen7_gs_state.c
>> index 497ecec..8d6d3fe 100644
>> --- a/src/mesa/drivers/dri/i965/gen7_gs_state.c
>> +++ b/src/mesa/drivers/dri/i965/gen7_gs_state.c
>> @@ -59,9 +59,7 @@ upload_gs_state(struct brw_context *brw)
>>        OUT_BATCH(((ALIGN(stage_state->sampler_count, 4)/4) <<
>>                   GEN6_GS_SAMPLER_COUNT_SHIFT) |
>>                  ((brw->gs.prog_data->base.base.binding_table.size_bytes / 4) <<
>> -                 GEN6_GS_BINDING_TABLE_ENTRY_COUNT_SHIFT) |
>> -                (brw->is_haswell && prog_data->base.nr_image_params ?
>> -                 HSW_GS_UAV_ACCESS_ENABLE : 0));
>> +                 GEN6_GS_BINDING_TABLE_ENTRY_COUNT_SHIFT));
>>  
>>        if (brw->gs.prog_data->base.base.total_scratch) {
>>           OUT_RELOC(stage_state->scratch_bo,
>> diff --git a/src/mesa/drivers/dri/i965/gen7_vs_state.c b/src/mesa/drivers/dri/i965/gen7_vs_state.c
>> index b7e4858..a18dc69 100644
>> --- a/src/mesa/drivers/dri/i965/gen7_vs_state.c
>> +++ b/src/mesa/drivers/dri/i965/gen7_vs_state.c
>> @@ -126,9 +126,7 @@ upload_vs_state(struct brw_context *brw)
>>  	     ((ALIGN(stage_state->sampler_count, 4)/4) <<
>>                GEN6_VS_SAMPLER_COUNT_SHIFT) |
>>               ((brw->vs.prog_data->base.base.binding_table.size_bytes / 4) <<
>> -              GEN6_VS_BINDING_TABLE_ENTRY_COUNT_SHIFT) |
>> -             (brw->is_haswell && prog_data->base.nr_image_params ?
>> -              HSW_VS_UAV_ACCESS_ENABLE : 0));
>> +              GEN6_VS_BINDING_TABLE_ENTRY_COUNT_SHIFT));
>>  
>>     if (prog_data->base.total_scratch) {
>>        OUT_RELOC(stage_state->scratch_bo,
>> diff --git a/src/mesa/drivers/dri/i965/gen7_wm_state.c b/src/mesa/drivers/dri/i965/gen7_wm_state.c
>> index fd6dab5..06d5e65 100644
>> --- a/src/mesa/drivers/dri/i965/gen7_wm_state.c
>> +++ b/src/mesa/drivers/dri/i965/gen7_wm_state.c
>> @@ -113,7 +113,14 @@ upload_wm_state(struct brw_context *brw)
>>     else if (prog_data->base.nr_image_params)
>>        dw1 |= GEN7_WM_EARLY_DS_CONTROL_PSEXEC;
>>  
>> -   /* _NEW_BUFFERS | _NEW_COLOR */
>> +   /* The "UAV access enable" bits are unnecessary on HSW because they only
>> +    * seem to have an effect on the HW-assisted coherency mechanism which we
>> +    * don't need, and the rasterization-related UAV_ONLY flag and the
>> +    * DISPATCH_ENABLE bit can be set independently from it.
>> +    * C.f. gen8_upload_ps_extra().
>> +    *
>> +    * BRW_NEW_FRAGMENT_PROGRAM | BRW_NEW_FS_PROG_DATA | _NEW_BUFFERS | _NEW_COLOR
>> +    */
>>     if (brw->is_haswell &&
>>         !(brw_color_buffer_write_enabled(brw) || writes_depth) &&
>>         prog_data->base.nr_image_params)
>> @@ -221,9 +228,6 @@ gen7_upload_ps_state(struct brw_context *brw,
>>        _mesa_get_min_invocations_per_fragment(ctx, fp, false);
>>     assert(min_inv_per_frag >= 1);
>>  
>> -   if (brw->is_haswell && prog_data->base.nr_image_params)
>> -      dw4 |= HSW_PS_UAV_ACCESS_ENABLE;
>> -
>>     if (prog_data->prog_offset_16 || prog_data->no_8) {
>>        dw4 |= GEN7_PS_16_DISPATCH_ENABLE;
>>        if (!prog_data->no_8 && min_inv_per_frag == 1) {
>> diff --git a/src/mesa/drivers/dri/i965/gen8_gs_state.c b/src/mesa/drivers/dri/i965/gen8_gs_state.c
>> index 81bd3b2..26a02d3 100644
>> --- a/src/mesa/drivers/dri/i965/gen8_gs_state.c
>> +++ b/src/mesa/drivers/dri/i965/gen8_gs_state.c
>> @@ -52,9 +52,7 @@ gen8_upload_gs_state(struct brw_context *brw)
>>                  ((ALIGN(stage_state->sampler_count, 4)/4) <<
>>                   GEN6_GS_SAMPLER_COUNT_SHIFT) |
>>                  ((prog_data->base.binding_table.size_bytes / 4) <<
>> -                 GEN6_GS_BINDING_TABLE_ENTRY_COUNT_SHIFT) |
>> -                (prog_data->base.nr_image_params ?
>> -                 HSW_GS_UAV_ACCESS_ENABLE : 0));
>> +                 GEN6_GS_BINDING_TABLE_ENTRY_COUNT_SHIFT));
>>  
>>        if (brw->gs.prog_data->base.base.total_scratch) {
>>           OUT_RELOC64(stage_state->scratch_bo,
>> diff --git a/src/mesa/drivers/dri/i965/gen8_ps_state.c b/src/mesa/drivers/dri/i965/gen8_ps_state.c
>> index ae18f0f..a6c9ab3 100644
>> --- a/src/mesa/drivers/dri/i965/gen8_ps_state.c
>> +++ b/src/mesa/drivers/dri/i965/gen8_ps_state.c
>> @@ -25,6 +25,7 @@
>>  #include "program/program.h"
>>  #include "brw_state.h"
>>  #include "brw_defines.h"
>> +#include "brw_wm.h"
>>  #include "intel_batchbuffer.h"
>>  
>>  void
>> @@ -61,8 +62,33 @@ gen8_upload_ps_extra(struct brw_context *brw,
>>     if (brw->gen >= 9 && prog_data->pulls_bary)
>>        dw1 |= GEN9_PSX_SHADER_PULLS_BARY;
>>  
>> -   if (_mesa_active_fragment_shader_has_atomic_ops(&brw->ctx) ||
>> -       prog_data->base.nr_image_params)
>> +   /* The stricter cross-primitive coherency guarantees that the hardware
>> +    * gives us with the "Accesses UAV" bit set for at least one shader stage
>> +    * and the "UAV coherency required" bit set on the 3DPRIMITIVE command are
>> +    * redundant within the current image, atomic counter and SSBO GL APIs,
>> +    * which all have very loose ordering and coherency requirements and
>> +    * generally rely on the application to insert explicit barriers when a
>> +    * shader invocation is expected to see the memory writes performed by the
>> +    * invocations of some previous primitive.  Regardless of the value of "UAV
>> +    * coherency required", the "Accesses UAV" bits will implicitly cause an in
>> +    * most cases useless DC flush when the lowermost stage with the bit set
>> +    * finishes execution.
>> +    *
>> +    * It would be nice to disable it, but in some cases we can't because on
>> +    * Gen8+ it also has an influence on rasterization via the PS UAV-only
>> +    * signal (which could be set independently from the coherency mechanism in
>> +    * the 3DSTATE_WM command on Gen7), and because in some cases it will
>> +    * determine whether the hardware skips execution of the fragment shader or
>> +    * not via the ThreadDispatchEnable signal.  However if we know that
>> +    * GEN8_PS_BLEND_HAS_WRITEABLE_RT is going to be set and
>> +    * GEN8_PSX_PIXEL_SHADER_NO_RT_WRITE is not set it shouldn't make any
>> +    * difference so we may just disable it here.
>> +    *
>> +    * BRW_NEW_FS_PROG_DATA | BRW_NEW_FRAGMENT_PROGRAM | _NEW_BUFFERS | _NEW_COLOR
>> +    */
>> +   if ((_mesa_active_fragment_shader_has_atomic_ops(&brw->ctx) ||
>> +        prog_data->base.nr_image_params) &&
>> +       !brw_color_buffer_write_enabled(brw))
>>        dw1 |= GEN8_PSX_SHADER_HAS_UAV;
>>  
>>     BEGIN_BATCH(2);
>> @@ -87,7 +113,7 @@ upload_ps_extra(struct brw_context *brw)
>>  
>>  const struct brw_tracked_state gen8_ps_extra = {
>>     .dirty = {
>> -      .mesa  = 0,
>> +      .mesa  = _NEW_BUFFERS | _NEW_COLOR,
>>        .brw   = BRW_NEW_CONTEXT |
>>                 BRW_NEW_FRAGMENT_PROGRAM |
>>                 BRW_NEW_FS_PROG_DATA |
>> diff --git a/src/mesa/drivers/dri/i965/gen8_vs_state.c b/src/mesa/drivers/dri/i965/gen8_vs_state.c
>> index 8b5048b..28f5add 100644
>> --- a/src/mesa/drivers/dri/i965/gen8_vs_state.c
>> +++ b/src/mesa/drivers/dri/i965/gen8_vs_state.c
>> @@ -53,9 +53,7 @@ upload_vs_state(struct brw_context *brw)
>>               ((ALIGN(stage_state->sampler_count, 4) / 4) <<
>>                 GEN6_VS_SAMPLER_COUNT_SHIFT) |
>>               ((prog_data->base.binding_table.size_bytes / 4) <<
>> -               GEN6_VS_BINDING_TABLE_ENTRY_COUNT_SHIFT) |
>> -             (prog_data->base.nr_image_params ?
>> -              HSW_VS_UAV_ACCESS_ENABLE : 0));
>> +               GEN6_VS_BINDING_TABLE_ENTRY_COUNT_SHIFT));
>>  
>>     if (prog_data->base.total_scratch) {
>>        OUT_RELOC64(stage_state->scratch_bo,
>> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20150815/f9be81cb/attachment.sig>


More information about the mesa-dev mailing list