[Mesa-dev] [PATCH v2] intel/compiler: Properly consider UBO loads that cross 32B boundaries.

Thu Jun 14 02:17:19 UTC 2018

On Wednesday, June 13, 2018 6:52:07 PM PDT Jason Ekstrand wrote:
> On Wed, Jun 13, 2018 at 6:14 PM, Kenneth Graunke <kenneth at whitecape.org>
> wrote:
> 
> > The UBO push analysis pass incorrectly assumed that all values would fit
> > within a 32B chunk, and only recorded a bit for the 32B chunk containing
> > the starting offset.
> >
> > For example, if a UBO contained the following, tightly packed:
> >
> >    vec4 a;  // [0, 16)
> >    float b; // [16, 20)
> >    vec4 c;  // [20, 36)
> >
> > then, c would start at offset 20 / 32 = 0 and end at 36 / 32 = 1,
> > which means that we ought to record two 32B chunks in the bitfield.
> >
> > Similarly, dvec4s would suffer from the same problem.
> >
> > v2: Rewrite the accounting, my calculations were wrong.
> >
> > Reviewed-by: Rafael Antognolli <rafael.antognolli at intel.com> [v1]
> > ---
> >  src/intel/compiler/brw_nir_analyze_ubo_ranges.c | 9 ++++++++-
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> >
> > diff --git a/src/intel/compiler/brw_nir_analyze_ubo_ranges.c
> > b/src/intel/compiler/brw_nir_analyze_ubo_ranges.c
> > index d58fe3dd2e3..31026ca65ba 100644
> > --- a/src/intel/compiler/brw_nir_analyze_ubo_ranges.c
> > +++ b/src/intel/compiler/brw_nir_analyze_ubo_ranges.c
> > @@ -141,10 +141,17 @@ analyze_ubos_block(struct ubo_analysis_state *state,
> > nir_block *block)
> >           if (offset >= 64)
> >              continue;
> >
> 
> This should take end into account, not just offset.

I disagree, actually.  This check is to avoid left shifting by a value
larger than 64, which is undefined behavior in C.  Shifting a larger
mask by 63 and having bits fall of the end (e.g. 3ull << 64) is well
defined behavior.

Of course, that means that we'll be pushing a partial value.  That
actually works fine today, surprisingly.  In my commit message's
packed <vec4, float, vec4> example, this bug would cause us to only
push .xyz of the last vec4.  The backend compiler correctly falls
back to pull loads for the .w component.

Note that this code doesn't handle the 3DSTATE_CONSTANT_XS restriction
that the sum of all four read lengths has to be <= 64.  It can't,
realistically, because it has no idea what GL is going to do with
regular push uniforms.  So, the backend has code to prune ranges to
fit in the restricted length - which can cut off parts of vectors - so
it already properly supports this case.

> > +         /* The value might span multiple 32-byte chunks. */
> > +         const int bytes = nir_intrinsic_dest_components(intrin) *
> > +                           (nir_dest_bit_size(intrin->dest) / 8);
> > +         const int start = ROUND_DOWN_TO(offset_const->u32[0], 32);
> > +         const int end = ALIGN(offset_const->u32[0] + bytes, 32);
> >
> 
> You could probably use / and DIV_ROUND_UP here too.  I'm not sure which is
> better TBH.
> 
> --Jason

I could, but I botched that several times.  This way rounds the start
down, and the end up, to the nearest 32B address...so we expand the
range to cover the right region, all in bytes.  Then we divide.

I figured that was simpler.

> > +         const int chunks = (end - start) / 32;
> > +
> >           /* TODO: should we count uses in loops as higher benefit? */
> >
> >           struct ubo_block_info *info = get_block_info(state, block);
> > -         info->offsets |= 1ull << offset;
> > +         info->offsets |= ((1ull << chunks) - 1) << offset;
> >           info->uses[offset]++;
> >        }
> >     }
> > --
> > 2.17.0
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part.
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20180613/901001b4/attachment.sig>