[Mesa-dev] [PATCH v2 15/30] i965/fs: support doubles with UBO loads

Sat May 14 22:13:52 UTC 2016

Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:

> On 14/05/16 01:16, Francisco Jerez wrote:
>> Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
>> 
>>> From: Iago Toral Quiroga <itoral at igalia.com>
>>>
>>> UBO loads with constant offset use the UNIFORM_PULL_CONSTANT_LOAD
>>> instruction, which reads 16 bytes (a vec4) of data from memory. For dvec
>>> types this only provides components x and y. Thus, if we are reading
>>> more than 2 components we need to issue a second load at offset+16 to
>>> read the next 16-byte chunk with components w and z.
>>>
>>> UBO loads with non-constant offset emit a load for each component
>>> in the vector (and rely in CSE to fix redundant loads), so we only
>>> need to consider the size of the data type when computing the offset
>>> of each element in a vector.
>>>
>>> v2 (Sam):
>>> - Adapt the code to use component() (Curro).
>>>
>>> Signed-off-by: Samuel Iglesias Gonsálvez <siglesias at igalia.com>
>>> Reviewed-by: Kenneth Graunke <kenneth at whitecape.org>
>>> ---
>>>  src/mesa/drivers/dri/i965/brw_fs_nir.cpp | 52 +++++++++++++++++++++++++++-----
>>>  1 file changed, 45 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs_nir.cpp b/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>>> index 2d57fd3..02f1e81 100644
>>> --- a/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>>> +++ b/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>>> @@ -3362,6 +3362,9 @@ fs_visitor::nir_emit_intrinsic(const fs_builder &bld, nir_intrinsic_instr *instr
>>>                                 nir->info.num_ubos - 1);
>>>        }
>>>  
>>> +      /* Number of 32-bit slots in the type */
>>> +      unsigned type_slots = MAX2(1, type_sz(dest.type) / 4);
>>> +
>>>        nir_const_value *const_offset = nir_src_as_const_value(instr->src[1]);
>>>        if (const_offset == NULL) {
>>>           fs_reg base_offset = retype(get_nir_src(instr->src[1]),
>>> @@ -3369,19 +3372,54 @@ fs_visitor::nir_emit_intrinsic(const fs_builder &bld, nir_intrinsic_instr *instr
>>>  
>>>           for (int i = 0; i < instr->num_components; i++)
>>>              VARYING_PULL_CONSTANT_LOAD(bld, offset(dest, bld, i), surf_index,
>>> -                                       base_offset, i * 4);
>>> +                                       base_offset, i * 4 * type_slots);
>> 
>> Why not 'i * type_sz(...)'?  As before it seems like type_slots is just
>> going to introduce rounding errors here for no benefit?
>> 
>
> Right, I will fix it.
>
>>>        } else {
>>> +         /* Even if we are loading doubles, a pull constant load will load
>>> +          * a 32-bit vec4, so should only reserve vgrf space for that. If we
>>> +          * need to load a full dvec4 we will have to emit 2 loads. This is
>>> +          * similar to demote_pull_constants(), except that in that case we
>>> +          * see individual accesses to each component of the vector and then
>>> +          * we let CSE deal with duplicate loads. Here we see a vector access
>>> +          * and we have to split it if necessary.
>>> +          */
>>>           fs_reg packed_consts = vgrf(glsl_type::float_type);
>>>           packed_consts.type = dest.type;
>>>  
>>> -         struct brw_reg const_offset_reg = brw_imm_ud(const_offset->u32[0] & ~15);
>>> -         bld.emit(FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD, packed_consts,
>>> -                  surf_index, const_offset_reg);
>>> +         unsigned const_offset_aligned = const_offset->u32[0] & ~15;
>>> +
>>> +         /* A vec4 only contains half of a dvec4, if we need more than 2
>>> +          * components of a dvec4 we will have to issue another load for
>>> +          * components z and w
>>> +          */
>>> +         int num_components;
>>> +         if (type_slots == 1)
>>> +            num_components = instr->num_components;
>>> +         else
>>> +            num_components = MIN2(2, instr->num_components);
>>>
>>> -         const fs_reg consts = byte_offset(packed_consts, const_offset->u32[0] % 16);
>>> +         int remaining_components = instr->num_components;
>>> +         while (remaining_components > 0) {
>>> +            /* Read the vec4 from a 16-byte aligned offset */
>>> +            struct brw_reg const_offset_reg = brw_imm_ud(const_offset_aligned);
>>> +            bld.emit(FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD,
>>> +                     retype(packed_consts, BRW_REGISTER_TYPE_F),
>>> +                     surf_index, const_offset_reg);
>>>  
>>> -         for (unsigned i = 0; i < instr->num_components; i++)
>>> -            bld.MOV(offset(dest, bld, i), component(consts, i));
>>> +            const fs_reg consts = byte_offset(packed_consts, (const_offset->u32[0] % 16));
>> 
>> This looks really fishy to me, if the initial offset is not 16B aligned
>> you'll apply the same sub-16B offset to the result from each one of the
>> subsequent pull constant loads.
>
> This cannot happen thanks to the layout alignment rules, see below.
>
>> Also you don't seem to take into
>> account whether the initial offset is misaligned in the calculation of
>> num_components -- If it is it looks like the first pull constant load
>> could return less than "num_components" usable components and you would
>> end up reading past the end of the return payload of the message.
>> 
>
> This is not a problem thanks to std140 layout alignment rules. The only
> case we have a 16B misalignment is when we have a bunch of scalars (or
> vec2s) defined together.
>
> In case of the scalars, we are going to pick only one element in each
> case, so there is no problem. In case of the vec2s, the second vec2 is
> at offset 8B and we only read two components in total, then there isn't
> a second pull constant load.
>
> When we are dealing with doubles is the same but multiply by two: the
> double case would be like the vec2 one but reading one double. dvec2 is
> 16B aligned. dvec3 and dvec4 are 32B aligned. Everything else is aligned
> 16B (vec4 size).
>
> std430 layout doesn't apply to UBOs, although it is not a problem if it
> does in the future. In Mesa, the rest of layouts (shared, packed) follow
> std140's alignment rules.
>
Oh, wow, that's making quite some assumptions about the behavior of the
front-end, sounds like we could use some assertions verifying that the
assumptions you've made about the upper layers don't accidentally break,
e.g. somewhere close to the num_components computation:

| /* The computation of num_components doesn't take into account
|  * misalignment, which should be okay according to std140 vector
|  * alignment rules.
|  */
| assert(const_offset->u32[0] % 16 +
|        type_sz(dest.type) * num_components <= 16);

and inside the remaining_components while loop:

| /* XXX - This doesn't update the sub-16B offset across iterations of
|  *       the loop, which should work for std140 vector alignment rules.
|  */
| assert(dest_offset == 0 || const_offset->u32[0] % 16 == 0);

Honestly, none of these assumptions seem necessary nor particularly
helpful for the logic above -- Explaining and verifying them seems
almost more effort (and is certainly more confusing) than handling
arbitrary offsets consistently.  But because Ken seems rather impatient
to get FP64 support upstream I suggest you just add the the two
assertions above and my:

 Reviewed-by: Francisco Jerez <currojerez at riseup.net>

to this patch and push it.  As a follow-up clean-up you'd probably be
able to both simplify the iteration logic and handle arbitrary alignment
with less than half the code by doing something like:

| for (unsigned c = 0; c < total_components;) {
|    const fs_reg packed_consts = bld.vgrf(BRW_REGISTER_TYPE_F);
|    const unsigned base = base_offset_from_the_intrinsic + c * type_size;
|
|    /* Number of usable components we get from the next 16B-aligned load. */
|    const unsigned count = min(total_components - c, (16 - base % 16) / type_size);
| 
|    bld.emit(FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD,
|             packed_consts, surf_index, brw_imm_ud(base & ~15));
| 
|    const fs_reg consts = retype(byte_offset(packed_consts, base & 15),
|                                 dest.type);
| 
|    for (unsigned d = 0; d < count; d++)
|       bld.MOV(offset(dest, bld, c + d), component(consts, d));
| 
|    c += count;
| }

This gets rid of two out of three induction variables you use in the
while loop, the second restriction disappears because you always have
the *actual* offset in bytes of the next constant to fetch available,
and the first restriction is handled by a single '- base % 16' term in
the definition of 'count'.

Another possibility to consider is to ditch all the above and extend the
generator so it's able to fetch a whole 32B block of constants at a time
with a single message, which would let you handle this for 64bit types
in exactly the same way that the back-end currently handles 32bit
constant loads with a single pull load per dvec4.

> Sam
>
>>> +            unsigned dest_offset = instr->num_components - remaining_components;
>>> +
>>> +            for (int i = 0; i < num_components; i++)
>>> +               bld.MOV(offset(dest, bld, i + dest_offset), component(consts, i));
>>> +
>>> +            /* If this is a large enough 64-bit load, we will need to emit
>>> +             * another message
>>> +             */
>>> +            remaining_components -= num_components;
>>> +            assert(remaining_components == 0 ||
>>> +                   (remaining_components <= 2 && type_slots == 2));
>>> +            num_components = remaining_components;
>>> +            const_offset_aligned += 16;
>>> +         }
>>>        }
>>>        break;
>>>     }
>>> -- 
>>> 2.5.0
>>>
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20160514/a04752a6/attachment-0001.sig>