[Mesa-dev] [PATCH v2] intel/compiler: Properly consider UBO loads that cross 32B boundaries.

Thu Jun 14 02:25:37 UTC 2018

On June 13, 2018 19:17:21 Kenneth Graunke <kenneth at whitecape.org> wrote:

> On Wednesday, June 13, 2018 6:52:07 PM PDT Jason Ekstrand wrote:
>> On Wed, Jun 13, 2018 at 6:14 PM, Kenneth Graunke <kenneth at whitecape.org>
>> wrote:
>>
>>> The UBO push analysis pass incorrectly assumed that all values would fit
>>> within a 32B chunk, and only recorded a bit for the 32B chunk containing
>>> the starting offset.
>>>
>>> For example, if a UBO contained the following, tightly packed:
>>>
>>> vec4 a;  // [0, 16)
>>> float b; // [16, 20)
>>> vec4 c;  // [20, 36)
>>>
>>> then, c would start at offset 20 / 32 = 0 and end at 36 / 32 = 1,
>>> which means that we ought to record two 32B chunks in the bitfield.
>>>
>>> Similarly, dvec4s would suffer from the same problem.
>>>
>>> v2: Rewrite the accounting, my calculations were wrong.
>>>
>>> Reviewed-by: Rafael Antognolli <rafael.antognolli at intel.com> [v1]
>>> ---
>>> src/intel/compiler/brw_nir_analyze_ubo_ranges.c | 9 ++++++++-
>>> 1 file changed, 8 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/src/intel/compiler/brw_nir_analyze_ubo_ranges.c
>>> b/src/intel/compiler/brw_nir_analyze_ubo_ranges.c
>>> index d58fe3dd2e3..31026ca65ba 100644
>>> --- a/src/intel/compiler/brw_nir_analyze_ubo_ranges.c
>>> +++ b/src/intel/compiler/brw_nir_analyze_ubo_ranges.c
>>> @@ -141,10 +141,17 @@ analyze_ubos_block(struct ubo_analysis_state *state,
>>> nir_block *block)
>>>    if (offset >= 64)
>>>       continue;
>>
>> This should take end into account, not just offset.
>
> I disagree, actually.  This check is to avoid left shifting by a value
> larger than 64, which is undefined behavior in C.  Shifting a larger
> mask by 63 and having bits fall of the end (e.g. 3ull << 64) is well
> defined behavior.
>
> Of course, that means that we'll be pushing a partial value.  That
> actually works fine today, surprisingly.  In my commit message's
> packed <vec4, float, vec4> example, this bug would cause us to only
> push .xyz of the last vec4.  The backend compiler correctly falls
> back to pull loads for the .w component.

I'm convinced. Please add a comment to that effect.

>
> Note that this code doesn't handle the 3DSTATE_CONSTANT_XS restriction
> that the sum of all four read lengths has to be <= 64.  It can't,
> realistically, because it has no idea what GL is going to do with
> regular push uniforms.  So, the backend has code to prune ranges to
> fit in the restricted length - which can cut off parts of vectors - so
> it already properly supports this case.
>
>>> +         /* The value might span multiple 32-byte chunks. */
>>> +         const int bytes = nir_intrinsic_dest_components(intrin) *
>>> +                           (nir_dest_bit_size(intrin->dest) / 8);
>>> +         const int start = ROUND_DOWN_TO(offset_const->u32[0], 32);
>>> +         const int end = ALIGN(offset_const->u32[0] + bytes, 32);
>>
>> You could probably use / and DIV_ROUND_UP here too.  I'm not sure which is
>> better TBH.
>>
>> --Jason
>
> I could, but I botched that several times.  This way rounds the start
> down, and the end up, to the nearest 32B address...so we expand the
> range to cover the right region, all in bytes.  Then we divide.
>
> I figured that was simpler.

Fine by me.  R-b

--Jason

>
>>> +         const int chunks = (end - start) / 32;
>>> +
>>>    /* TODO: should we count uses in loops as higher benefit? */
>>>
>>>    struct ubo_block_info *info = get_block_info(state, block);
>>> -         info->offsets |= 1ull << offset;
>>> +         info->offsets |= ((1ull << chunks) - 1) << offset;
>>>    info->uses[offset]++;
>>> }
>>> }
>>> --
>>> 2.17.0