[Mesa-dev] [PATCH v2 15/30] i965/fs: support doubles with UBO loads

Mon May 16 08:54:28 UTC 2016

On 15/05/16 00:13, Francisco Jerez wrote:
> Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
> 
>> On 14/05/16 01:16, Francisco Jerez wrote:
>>> Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
>>>
>>>> From: Iago Toral Quiroga <itoral at igalia.com>
>>>>
>>>> UBO loads with constant offset use the UNIFORM_PULL_CONSTANT_LOAD
>>>> instruction, which reads 16 bytes (a vec4) of data from memory. For dvec
>>>> types this only provides components x and y. Thus, if we are reading
>>>> more than 2 components we need to issue a second load at offset+16 to
>>>> read the next 16-byte chunk with components w and z.
>>>>
>>>> UBO loads with non-constant offset emit a load for each component
>>>> in the vector (and rely in CSE to fix redundant loads), so we only
>>>> need to consider the size of the data type when computing the offset
>>>> of each element in a vector.
>>>>
>>>> v2 (Sam):
>>>> - Adapt the code to use component() (Curro).
>>>>
>>>> Signed-off-by: Samuel Iglesias Gonsálvez <siglesias at igalia.com>
>>>> Reviewed-by: Kenneth Graunke <kenneth at whitecape.org>
>>>> ---
>>>>  src/mesa/drivers/dri/i965/brw_fs_nir.cpp | 52 +++++++++++++++++++++++++++-----
>>>>  1 file changed, 45 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs_nir.cpp b/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>>>> index 2d57fd3..02f1e81 100644
>>>> --- a/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>>>> @@ -3362,6 +3362,9 @@ fs_visitor::nir_emit_intrinsic(const fs_builder &bld, nir_intrinsic_instr *instr
>>>>                                 nir->info.num_ubos - 1);
>>>>        }
>>>>  
>>>> +      /* Number of 32-bit slots in the type */
>>>> +      unsigned type_slots = MAX2(1, type_sz(dest.type) / 4);
>>>> +
>>>>        nir_const_value *const_offset = nir_src_as_const_value(instr->src[1]);
>>>>        if (const_offset == NULL) {
>>>>           fs_reg base_offset = retype(get_nir_src(instr->src[1]),
>>>> @@ -3369,19 +3372,54 @@ fs_visitor::nir_emit_intrinsic(const fs_builder &bld, nir_intrinsic_instr *instr
>>>>  
>>>>           for (int i = 0; i < instr->num_components; i++)
>>>>              VARYING_PULL_CONSTANT_LOAD(bld, offset(dest, bld, i), surf_index,
>>>> -                                       base_offset, i * 4);
>>>> +                                       base_offset, i * 4 * type_slots);
>>>
>>> Why not 'i * type_sz(...)'?  As before it seems like type_slots is just
>>> going to introduce rounding errors here for no benefit?
>>>
>>
>> Right, I will fix it.
>>
>>>>        } else {
>>>> +         /* Even if we are loading doubles, a pull constant load will load
>>>> +          * a 32-bit vec4, so should only reserve vgrf space for that. If we
>>>> +          * need to load a full dvec4 we will have to emit 2 loads. This is
>>>> +          * similar to demote_pull_constants(), except that in that case we
>>>> +          * see individual accesses to each component of the vector and then
>>>> +          * we let CSE deal with duplicate loads. Here we see a vector access
>>>> +          * and we have to split it if necessary.
>>>> +          */
>>>>           fs_reg packed_consts = vgrf(glsl_type::float_type);
>>>>           packed_consts.type = dest.type;
>>>>  
>>>> -         struct brw_reg const_offset_reg = brw_imm_ud(const_offset->u32[0] & ~15);
>>>> -         bld.emit(FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD, packed_consts,
>>>> -                  surf_index, const_offset_reg);
>>>> +         unsigned const_offset_aligned = const_offset->u32[0] & ~15;
>>>> +
>>>> +         /* A vec4 only contains half of a dvec4, if we need more than 2
>>>> +          * components of a dvec4 we will have to issue another load for
>>>> +          * components z and w
>>>> +          */
>>>> +         int num_components;
>>>> +         if (type_slots == 1)
>>>> +            num_components = instr->num_components;
>>>> +         else
>>>> +            num_components = MIN2(2, instr->num_components);
>>>>
>>>> -         const fs_reg consts = byte_offset(packed_consts, const_offset->u32[0] % 16);
>>>> +         int remaining_components = instr->num_components;
>>>> +         while (remaining_components > 0) {
>>>> +            /* Read the vec4 from a 16-byte aligned offset */
>>>> +            struct brw_reg const_offset_reg = brw_imm_ud(const_offset_aligned);
>>>> +            bld.emit(FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD,
>>>> +                     retype(packed_consts, BRW_REGISTER_TYPE_F),
>>>> +                     surf_index, const_offset_reg);
>>>>  
>>>> -         for (unsigned i = 0; i < instr->num_components; i++)
>>>> -            bld.MOV(offset(dest, bld, i), component(consts, i));
>>>> +            const fs_reg consts = byte_offset(packed_consts, (const_offset->u32[0] % 16));
>>>
>>> This looks really fishy to me, if the initial offset is not 16B aligned
>>> you'll apply the same sub-16B offset to the result from each one of the
>>> subsequent pull constant loads.
>>
>> This cannot happen thanks to the layout alignment rules, see below.
>>
>>> Also you don't seem to take into
>>> account whether the initial offset is misaligned in the calculation of
>>> num_components -- If it is it looks like the first pull constant load
>>> could return less than "num_components" usable components and you would
>>> end up reading past the end of the return payload of the message.
>>>
>>
>> This is not a problem thanks to std140 layout alignment rules. The only
>> case we have a 16B misalignment is when we have a bunch of scalars (or
>> vec2s) defined together.
>>
>> In case of the scalars, we are going to pick only one element in each
>> case, so there is no problem. In case of the vec2s, the second vec2 is
>> at offset 8B and we only read two components in total, then there isn't
>> a second pull constant load.
>>
>> When we are dealing with doubles is the same but multiply by two: the
>> double case would be like the vec2 one but reading one double. dvec2 is
>> 16B aligned. dvec3 and dvec4 are 32B aligned. Everything else is aligned
>> 16B (vec4 size).
>>
>> std430 layout doesn't apply to UBOs, although it is not a problem if it
>> does in the future. In Mesa, the rest of layouts (shared, packed) follow
>> std140's alignment rules.
>>
> Oh, wow, that's making quite some assumptions about the behavior of the
> front-end, sounds like we could use some assertions verifying that the
> assumptions you've made about the upper layers don't accidentally break,
> e.g. somewhere close to the num_components computation:
> 
> | /* The computation of num_components doesn't take into account
> |  * misalignment, which should be okay according to std140 vector
> |  * alignment rules.
> |  */
> | assert(const_offset->u32[0] % 16 +
> |        type_sz(dest.type) * num_components <= 16);
> 
> and inside the remaining_components while loop:
> 
> | /* XXX - This doesn't update the sub-16B offset across iterations of
> |  *       the loop, which should work for std140 vector alignment rules.
> |  */
> | assert(dest_offset == 0 || const_offset->u32[0] % 16 == 0);
> 
> Honestly, none of these assumptions seem necessary nor particularly
> helpful for the logic above -- Explaining and verifying them seems
> almost more effort (and is certainly more confusing) than handling
> arbitrary offsets consistently.  But because Ken seems rather impatient
> to get FP64 support upstream I suggest you just add the the two
> assertions above and my:
> 
>  Reviewed-by: Francisco Jerez <currojerez at riseup.net>
> 

Thanks a lot Curro, I added those asserts and your R-b.

> to this patch and push it.  As a follow-up clean-up you'd probably be
> able to both simplify the iteration logic and handle arbitrary alignment
> with less than half the code by doing something like:
> 
> | for (unsigned c = 0; c < total_components;) {
> |    const fs_reg packed_consts = bld.vgrf(BRW_REGISTER_TYPE_F);
> |    const unsigned base = base_offset_from_the_intrinsic + c * type_size;
> |
> |    /* Number of usable components we get from the next 16B-aligned load. */
> |    const unsigned count = min(total_components - c, (16 - base % 16) / type_size);
> | 
> |    bld.emit(FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD,
> |             packed_consts, surf_index, brw_imm_ud(base & ~15));
> | 
> |    const fs_reg consts = retype(byte_offset(packed_consts, base & 15),
> |                                 dest.type);
> | 
> |    for (unsigned d = 0; d < count; d++)
> |       bld.MOV(offset(dest, bld, c + d), component(consts, d));
> | 
> |    c += count;
> | }
> 
> This gets rid of two out of three induction variables you use in the
> while loop, the second restriction disappears because you always have
> the *actual* offset in bytes of the next constant to fetch available,
> and the first restriction is handled by a single '- base % 16' term in
> the definition of 'count'.
> 
> Another possibility to consider is to ditch all the above and extend the
> generator so it's able to fetch a whole 32B block of constants at a time
> with a single message, which would let you handle this for 64bit types
> in exactly the same way that the back-end currently handles 32bit
> constant loads with a single pull load per dvec4.
> 

OK, we will work on this in the follow-up patch series.

Thanks again!

Sam

>> Sam
>>
>>>> +            unsigned dest_offset = instr->num_components - remaining_components;
>>>> +
>>>> +            for (int i = 0; i < num_components; i++)
>>>> +               bld.MOV(offset(dest, bld, i + dest_offset), component(consts, i));
>>>> +
>>>> +            /* If this is a large enough 64-bit load, we will need to emit
>>>> +             * another message
>>>> +             */
>>>> +            remaining_components -= num_components;
>>>> +            assert(remaining_components == 0 ||
>>>> +                   (remaining_components <= 2 && type_slots == 2));
>>>> +            num_components = remaining_components;
>>>> +            const_offset_aligned += 16;
>>>> +         }
>>>>        }
>>>>        break;
>>>>     }
>>>> -- 
>>>> 2.5.0
>>>>
>>>> _______________________________________________
>>>> mesa-dev mailing list
>>>> mesa-dev at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev