[Mesa-dev] [PATCH v2] i965: skip reading clip distances from the URB for the FS if possible

Iago Toral itoral at igalia.com
Fri Sep 29 06:00:20 UTC 2017


On Thu, 2017-09-28 at 21:55 -0700, Kenneth Graunke wrote:
> On Thursday, September 28, 2017 3:33:12 AM PDT Iago Toral Quiroga
> wrote:
> > we can skip these slots when they are not read in the fragment
> > shader
> > and they are positioned right after a VUE header that we are
> > already
> > skipping. We also need to ensure that we are passing at least one
> > other
> > varying, since that is a hardware requirement.
> > 
> > This helps alleviate a problem introduced with 99df02ca26f61 for
> > separate shader objects: without separate shader objects we assign
> > locations sequentially, however, since that commit we have changed
> > the
> > method for SSO so that the VUE slot assigned depends on the number
> > of
> > builtin slots plus the location assigned to the varying. This fixed
> > layout is intended to help SSO programs by avoiding on-the-fly
> > recompiles
> > when swapping out shaders, however, it also means that if a varying
> > uses
> > a large location number close to the maximum allowed by the SF/FS
> > units
> > (31), then the offset introduced by the number of builtin slots can
> > push
> > the location outside the range and trigger an assertion.
> > 
> > This problem is affecting at least the following CTS tests for
> > enhanced layouts:
> > 
> > KHR-GL45.enhanced_layouts.varying_array_components
> > KHR-GL45.enhanced_layouts.varying_array_locations
> > KHR-GL45.enhanced_layouts.varying_components
> > KHR-GL45.enhanced_layouts.varying_locations
> > 
> > which use SSO and the the location layout qualifier to select such
> > location numbers explicitly.
> > 
> > This change helps these tests because for SSO we always have to
> > include
> > VARYING_SLOT_CLIP_DIST{0,1} even if the fragment shader is very
> > unlikely
> > to read them, so by doing this we free builtin slots from the fixed
> > VUE
> > layout and we avoid the tests to crash in this scenario.
> > 
> > Of course, this is not a proper fix, we'd still run into problems
> > if someone
> > tries to use an explicit max location and read gl_ViewportIndex,
> > gl_LayerID or
> > gl_CullDistancein in the FS, but that would be a much less common
> > bug and we
> > can probably wait to see if anyone actually runs into that
> > situation in a real
> > world scenario before making the decision that more aggresive
> > changes are
> > required to support this without reverting 99df02ca26f61.
> > 
> > v2:
> > - Add a debug message when we skip clip distances (Ilia)
> > - we also need to account for this when we compute the urb setup
> >   for the fragment shader stage, so add a compiler util to compute
> >   the first slot that we need to read from the URB instead of
> >   replicating the logic in both places.
> > 
> > Suggested-by: Kenneth Graunke <kenneth at whitecape.org>
> > ---
> >  src/intel/compiler/brw_compiler.h             | 50
> > +++++++++++++++++++++++++++
> >  src/intel/compiler/brw_fs.cpp                 |  7 ++--
> >  src/mesa/drivers/dri/i965/genX_state_upload.c | 16 ++++-----
> >  3 files changed, 59 insertions(+), 14 deletions(-)
> > 
> > diff --git a/src/intel/compiler/brw_compiler.h
> > b/src/intel/compiler/brw_compiler.h
> > index 6753a8daf0..411919d829 100644
> > --- a/src/intel/compiler/brw_compiler.h
> > +++ b/src/intel/compiler/brw_compiler.h
> > @@ -25,6 +25,7 @@
> >  #define BRW_COMPILER_H
> >  
> >  #include <stdio.h>
> > +#include "common/gen_debug.h"
> >  #include "common/gen_device_info.h"
> >  #include "main/mtypes.h"
> >  #include "main/macros.h"
> > @@ -1222,6 +1223,55 @@ brw_stage_has_packed_dispatch(const struct
> > gen_device_info *devinfo,
> >     }
> >  }
> >  
> > +static inline int
> > +brw_compute_sf_first_urb_slot(uint64_t inputs_read,
> > +                              const struct brw_vue_map
> > *prev_stage_vue_map)
> > +{
> > +   /* If the fragment shader reads VARYING_SLOT_LAYER, then we
> > need to pass in
> > +    * the full vertex header.  Otherwise, we can program the SF to
> > start
> > +    * reading at an offset of 1 (2 varying slots) to skip
> > unnecessary data:
> > +    * - VARYING_SLOT_PSIZ and BRW_VARYING_SLOT_NDC on gen4-5
> > +    * - VARYING_SLOT_{PSIZ,LAYER} and VARYING_SLOT_POS on gen6+
> > +    */
> > +   bool include_vue_header =
> > +     inputs_read & (VARYING_BIT_LAYER | VARYING_BIT_VIEWPORT);
> > +
> > +   int first_slot =
> > +      include_vue_header ? 0 : 2 * BRW_SF_URB_ENTRY_READ_OFFSET;
> > +
> > +   /* If we are already skipping the first 2 slots for the VUE
> > header then we
> > +    * can also skip clip distances if they are located right after
> > the header
> > +    * and the FS doesn't read them. Only do this is there are any
> > user varyings
> > +    * though, since the hardware requires that we pass at least
> > one varying
> > +    * between stages.
> > +    */
> > +   uint64_t var_bits = inputs_read & (~(VARYING_BIT_VAR(0) - 1));
> > +   uint64_t clip_dist_bits = VARYING_BIT_CLIP_DIST0 |
> > VARYING_BIT_CLIP_DIST1;
> > +
> > +   if (!include_vue_header && var_bits &&
> > +       (inputs_read & clip_dist_bits) == 0 &&
> > +       (prev_stage_vue_map->slots_valid & clip_dist_bits)) {
> > +
> > +      uint32_t clip_dist0_slot =
> > +         prev_stage_vue_map-
> > >varying_to_slot[VARYING_SLOT_CLIP_DIST0];
> > +
> > +      uint32_t clip_dist1_slot =
> > +         prev_stage_vue_map-
> > >varying_to_slot[VARYING_SLOT_CLIP_DIST1];
> > +
> > +      if (clip_dist0_slot >= 2 && clip_dist0_slot <= 3 &&
> > +          clip_dist1_slot >= 2 && clip_dist1_slot <= 3) {
> > +         first_slot += 2 * BRW_SF_URB_ENTRY_READ_OFFSET;
> 
> Do we actually really care about clip distance,
> specifically?  Couldn't
> we basically just do:
> 
>    for (int i = 0; i < prev_stage_vue_map->num_slots; i++) {
>       int varying = prev_stage_vue_map->slot_to_varying[i];
>       if (varying > 0 && (inputs_read & BITFIELD64_BIT(varying)) !=
> 0)
>          return ROUND_DOWN_TO(i, 2);
>    }
> 
>    return 0;
> 
> to find the first slot occupied by something, or the header slot if
> there are no varyings read at all?  That would take care of skipping
> over the header, clip distance, gl_Color, and any other junk slots.

Yes, that is a lot more useful, I don't know why I focused so much in
clip distances specifically. I'll send a v3, thanks Ken!

> > +
> > +         if (unlikely(INTEL_DEBUG & DEBUG_URB)) {
> > +            fprintf(stderr, "SF unit skips unused clip distance
> > slots after "
> > +                    "the VUE header.\n");
> > +         }
> > +      }
> > +   }
> > +
> > +   return first_slot;
> > +}
> > +
> >  #ifdef __cplusplus
> >  } /* extern "C" */
> >  #endif
> > diff --git a/src/intel/compiler/brw_fs.cpp
> > b/src/intel/compiler/brw_fs.cpp
> > index b13b56cfba..5b444f7baa 100644
> > --- a/src/intel/compiler/brw_fs.cpp
> > +++ b/src/intel/compiler/brw_fs.cpp
> > @@ -1481,9 +1481,6 @@ fs_visitor::calculate_urb_setup()
> >              }
> >           }
> >        } else {
> > -         bool include_vue_header =
> > -            nir->info.inputs_read & (VARYING_BIT_LAYER |
> > VARYING_BIT_VIEWPORT);
> > -
> >           /* We have enough input varyings that the SF/SBE pipeline
> > stage can't
> >            * arbitrarily rearrange them to suit our whim; we have
> > to put them
> >            * in an order that matches the output of the previous
> > pipeline stage
> > @@ -1493,8 +1490,10 @@ fs_visitor::calculate_urb_setup()
> >           brw_compute_vue_map(devinfo, &prev_stage_vue_map,
> >                               key->input_slots_valid,
> >                               nir->info.separate_shader);
> > +
> >           int first_slot =
> > -            include_vue_header ? 0 : 2 *
> > BRW_SF_URB_ENTRY_READ_OFFSET;
> > +            brw_compute_sf_first_urb_slot(nir->info.inputs_read,
> > +                                          &prev_stage_vue_map);
> >  
> >           assert(prev_stage_vue_map.num_slots <= first_slot + 32);
> >           for (int slot = first_slot; slot <
> > prev_stage_vue_map.num_slots;
> > diff --git a/src/mesa/drivers/dri/i965/genX_state_upload.c
> > b/src/mesa/drivers/dri/i965/genX_state_upload.c
> > index 2a99376e3c..a3c1cf5166 100644
> > --- a/src/mesa/drivers/dri/i965/genX_state_upload.c
> > +++ b/src/mesa/drivers/dri/i965/genX_state_upload.c
> > @@ -1029,17 +1029,13 @@ genX(calculate_attr_overrides)(const struct
> > brw_context *brw,
> >  
> >     *point_sprite_enables = 0;
> >  
> > -   /* If the fragment shader reads VARYING_SLOT_LAYER, then we
> > need to pass in
> > -    * the full vertex header.  Otherwise, we can program the SF to
> > start
> > -    * reading at an offset of 1 (2 varying slots) to skip
> > unnecessary data:
> > -    * - VARYING_SLOT_PSIZ and BRW_VARYING_SLOT_NDC on gen4-5
> > -    * - VARYING_SLOT_{PSIZ,LAYER} and VARYING_SLOT_POS on gen6+
> > -    */
> > -
> > -   bool fs_needs_vue_header = fp->info.inputs_read &
> > -      (VARYING_BIT_LAYER | VARYING_BIT_VIEWPORT);
> > +   int first_slot =
> > +      brw_compute_sf_first_urb_slot(fp->info.inputs_read,
> > +                                    &brw->vue_map_geom_out);
> >  
> > -   *urb_entry_read_offset = fs_needs_vue_header ? 0 : 1;
> > +   /* There are 2 slots in each offset */
> > +   assert(first_slot % 2 == 0);
> > +   *urb_entry_read_offset = first_slot / 2;
> >  
> >     /* From the Ivybridge PRM, Vol 2 Part 1, 3DSTATE_SBE,
> >      * description of dw10 Point Sprite Texture Coordinate Enable:
> > 
> 
> 


More information about the mesa-dev mailing list