[Mesa-dev] [PATCH v2] i965: skip reading clip distances from the URB for the FS if possible
Iago Toral
itoral at igalia.com
Fri Sep 29 06:00:20 UTC 2017
On Thu, 2017-09-28 at 21:55 -0700, Kenneth Graunke wrote:
> On Thursday, September 28, 2017 3:33:12 AM PDT Iago Toral Quiroga
> wrote:
> > we can skip these slots when they are not read in the fragment
> > shader
> > and they are positioned right after a VUE header that we are
> > already
> > skipping. We also need to ensure that we are passing at least one
> > other
> > varying, since that is a hardware requirement.
> >
> > This helps alleviate a problem introduced with 99df02ca26f61 for
> > separate shader objects: without separate shader objects we assign
> > locations sequentially, however, since that commit we have changed
> > the
> > method for SSO so that the VUE slot assigned depends on the number
> > of
> > builtin slots plus the location assigned to the varying. This fixed
> > layout is intended to help SSO programs by avoiding on-the-fly
> > recompiles
> > when swapping out shaders, however, it also means that if a varying
> > uses
> > a large location number close to the maximum allowed by the SF/FS
> > units
> > (31), then the offset introduced by the number of builtin slots can
> > push
> > the location outside the range and trigger an assertion.
> >
> > This problem is affecting at least the following CTS tests for
> > enhanced layouts:
> >
> > KHR-GL45.enhanced_layouts.varying_array_components
> > KHR-GL45.enhanced_layouts.varying_array_locations
> > KHR-GL45.enhanced_layouts.varying_components
> > KHR-GL45.enhanced_layouts.varying_locations
> >
> > which use SSO and the the location layout qualifier to select such
> > location numbers explicitly.
> >
> > This change helps these tests because for SSO we always have to
> > include
> > VARYING_SLOT_CLIP_DIST{0,1} even if the fragment shader is very
> > unlikely
> > to read them, so by doing this we free builtin slots from the fixed
> > VUE
> > layout and we avoid the tests to crash in this scenario.
> >
> > Of course, this is not a proper fix, we'd still run into problems
> > if someone
> > tries to use an explicit max location and read gl_ViewportIndex,
> > gl_LayerID or
> > gl_CullDistancein in the FS, but that would be a much less common
> > bug and we
> > can probably wait to see if anyone actually runs into that
> > situation in a real
> > world scenario before making the decision that more aggresive
> > changes are
> > required to support this without reverting 99df02ca26f61.
> >
> > v2:
> > - Add a debug message when we skip clip distances (Ilia)
> > - we also need to account for this when we compute the urb setup
> > for the fragment shader stage, so add a compiler util to compute
> > the first slot that we need to read from the URB instead of
> > replicating the logic in both places.
> >
> > Suggested-by: Kenneth Graunke <kenneth at whitecape.org>
> > ---
> > src/intel/compiler/brw_compiler.h | 50
> > +++++++++++++++++++++++++++
> > src/intel/compiler/brw_fs.cpp | 7 ++--
> > src/mesa/drivers/dri/i965/genX_state_upload.c | 16 ++++-----
> > 3 files changed, 59 insertions(+), 14 deletions(-)
> >
> > diff --git a/src/intel/compiler/brw_compiler.h
> > b/src/intel/compiler/brw_compiler.h
> > index 6753a8daf0..411919d829 100644
> > --- a/src/intel/compiler/brw_compiler.h
> > +++ b/src/intel/compiler/brw_compiler.h
> > @@ -25,6 +25,7 @@
> > #define BRW_COMPILER_H
> >
> > #include <stdio.h>
> > +#include "common/gen_debug.h"
> > #include "common/gen_device_info.h"
> > #include "main/mtypes.h"
> > #include "main/macros.h"
> > @@ -1222,6 +1223,55 @@ brw_stage_has_packed_dispatch(const struct
> > gen_device_info *devinfo,
> > }
> > }
> >
> > +static inline int
> > +brw_compute_sf_first_urb_slot(uint64_t inputs_read,
> > + const struct brw_vue_map
> > *prev_stage_vue_map)
> > +{
> > + /* If the fragment shader reads VARYING_SLOT_LAYER, then we
> > need to pass in
> > + * the full vertex header. Otherwise, we can program the SF to
> > start
> > + * reading at an offset of 1 (2 varying slots) to skip
> > unnecessary data:
> > + * - VARYING_SLOT_PSIZ and BRW_VARYING_SLOT_NDC on gen4-5
> > + * - VARYING_SLOT_{PSIZ,LAYER} and VARYING_SLOT_POS on gen6+
> > + */
> > + bool include_vue_header =
> > + inputs_read & (VARYING_BIT_LAYER | VARYING_BIT_VIEWPORT);
> > +
> > + int first_slot =
> > + include_vue_header ? 0 : 2 * BRW_SF_URB_ENTRY_READ_OFFSET;
> > +
> > + /* If we are already skipping the first 2 slots for the VUE
> > header then we
> > + * can also skip clip distances if they are located right after
> > the header
> > + * and the FS doesn't read them. Only do this is there are any
> > user varyings
> > + * though, since the hardware requires that we pass at least
> > one varying
> > + * between stages.
> > + */
> > + uint64_t var_bits = inputs_read & (~(VARYING_BIT_VAR(0) - 1));
> > + uint64_t clip_dist_bits = VARYING_BIT_CLIP_DIST0 |
> > VARYING_BIT_CLIP_DIST1;
> > +
> > + if (!include_vue_header && var_bits &&
> > + (inputs_read & clip_dist_bits) == 0 &&
> > + (prev_stage_vue_map->slots_valid & clip_dist_bits)) {
> > +
> > + uint32_t clip_dist0_slot =
> > + prev_stage_vue_map-
> > >varying_to_slot[VARYING_SLOT_CLIP_DIST0];
> > +
> > + uint32_t clip_dist1_slot =
> > + prev_stage_vue_map-
> > >varying_to_slot[VARYING_SLOT_CLIP_DIST1];
> > +
> > + if (clip_dist0_slot >= 2 && clip_dist0_slot <= 3 &&
> > + clip_dist1_slot >= 2 && clip_dist1_slot <= 3) {
> > + first_slot += 2 * BRW_SF_URB_ENTRY_READ_OFFSET;
>
> Do we actually really care about clip distance,
> specifically? Couldn't
> we basically just do:
>
> for (int i = 0; i < prev_stage_vue_map->num_slots; i++) {
> int varying = prev_stage_vue_map->slot_to_varying[i];
> if (varying > 0 && (inputs_read & BITFIELD64_BIT(varying)) !=
> 0)
> return ROUND_DOWN_TO(i, 2);
> }
>
> return 0;
>
> to find the first slot occupied by something, or the header slot if
> there are no varyings read at all? That would take care of skipping
> over the header, clip distance, gl_Color, and any other junk slots.
Yes, that is a lot more useful, I don't know why I focused so much in
clip distances specifically. I'll send a v3, thanks Ken!
> > +
> > + if (unlikely(INTEL_DEBUG & DEBUG_URB)) {
> > + fprintf(stderr, "SF unit skips unused clip distance
> > slots after "
> > + "the VUE header.\n");
> > + }
> > + }
> > + }
> > +
> > + return first_slot;
> > +}
> > +
> > #ifdef __cplusplus
> > } /* extern "C" */
> > #endif
> > diff --git a/src/intel/compiler/brw_fs.cpp
> > b/src/intel/compiler/brw_fs.cpp
> > index b13b56cfba..5b444f7baa 100644
> > --- a/src/intel/compiler/brw_fs.cpp
> > +++ b/src/intel/compiler/brw_fs.cpp
> > @@ -1481,9 +1481,6 @@ fs_visitor::calculate_urb_setup()
> > }
> > }
> > } else {
> > - bool include_vue_header =
> > - nir->info.inputs_read & (VARYING_BIT_LAYER |
> > VARYING_BIT_VIEWPORT);
> > -
> > /* We have enough input varyings that the SF/SBE pipeline
> > stage can't
> > * arbitrarily rearrange them to suit our whim; we have
> > to put them
> > * in an order that matches the output of the previous
> > pipeline stage
> > @@ -1493,8 +1490,10 @@ fs_visitor::calculate_urb_setup()
> > brw_compute_vue_map(devinfo, &prev_stage_vue_map,
> > key->input_slots_valid,
> > nir->info.separate_shader);
> > +
> > int first_slot =
> > - include_vue_header ? 0 : 2 *
> > BRW_SF_URB_ENTRY_READ_OFFSET;
> > + brw_compute_sf_first_urb_slot(nir->info.inputs_read,
> > + &prev_stage_vue_map);
> >
> > assert(prev_stage_vue_map.num_slots <= first_slot + 32);
> > for (int slot = first_slot; slot <
> > prev_stage_vue_map.num_slots;
> > diff --git a/src/mesa/drivers/dri/i965/genX_state_upload.c
> > b/src/mesa/drivers/dri/i965/genX_state_upload.c
> > index 2a99376e3c..a3c1cf5166 100644
> > --- a/src/mesa/drivers/dri/i965/genX_state_upload.c
> > +++ b/src/mesa/drivers/dri/i965/genX_state_upload.c
> > @@ -1029,17 +1029,13 @@ genX(calculate_attr_overrides)(const struct
> > brw_context *brw,
> >
> > *point_sprite_enables = 0;
> >
> > - /* If the fragment shader reads VARYING_SLOT_LAYER, then we
> > need to pass in
> > - * the full vertex header. Otherwise, we can program the SF to
> > start
> > - * reading at an offset of 1 (2 varying slots) to skip
> > unnecessary data:
> > - * - VARYING_SLOT_PSIZ and BRW_VARYING_SLOT_NDC on gen4-5
> > - * - VARYING_SLOT_{PSIZ,LAYER} and VARYING_SLOT_POS on gen6+
> > - */
> > -
> > - bool fs_needs_vue_header = fp->info.inputs_read &
> > - (VARYING_BIT_LAYER | VARYING_BIT_VIEWPORT);
> > + int first_slot =
> > + brw_compute_sf_first_urb_slot(fp->info.inputs_read,
> > + &brw->vue_map_geom_out);
> >
> > - *urb_entry_read_offset = fs_needs_vue_header ? 0 : 1;
> > + /* There are 2 slots in each offset */
> > + assert(first_slot % 2 == 0);
> > + *urb_entry_read_offset = first_slot / 2;
> >
> > /* From the Ivybridge PRM, Vol 2 Part 1, 3DSTATE_SBE,
> > * description of dw10 Point Sprite Texture Coordinate Enable:
> >
>
>
More information about the mesa-dev
mailing list