[Mesa-dev] [PATCH v2 15/30] i965/fs: support doubles with UBO loads
Samuel Iglesias Gonsálvez
siglesias at igalia.com
Mon May 16 08:54:28 UTC 2016
On 15/05/16 00:13, Francisco Jerez wrote:
> Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
>
>> On 14/05/16 01:16, Francisco Jerez wrote:
>>> Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
>>>
>>>> From: Iago Toral Quiroga <itoral at igalia.com>
>>>>
>>>> UBO loads with constant offset use the UNIFORM_PULL_CONSTANT_LOAD
>>>> instruction, which reads 16 bytes (a vec4) of data from memory. For dvec
>>>> types this only provides components x and y. Thus, if we are reading
>>>> more than 2 components we need to issue a second load at offset+16 to
>>>> read the next 16-byte chunk with components w and z.
>>>>
>>>> UBO loads with non-constant offset emit a load for each component
>>>> in the vector (and rely in CSE to fix redundant loads), so we only
>>>> need to consider the size of the data type when computing the offset
>>>> of each element in a vector.
>>>>
>>>> v2 (Sam):
>>>> - Adapt the code to use component() (Curro).
>>>>
>>>> Signed-off-by: Samuel Iglesias Gonsálvez <siglesias at igalia.com>
>>>> Reviewed-by: Kenneth Graunke <kenneth at whitecape.org>
>>>> ---
>>>> src/mesa/drivers/dri/i965/brw_fs_nir.cpp | 52 +++++++++++++++++++++++++++-----
>>>> 1 file changed, 45 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs_nir.cpp b/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>>>> index 2d57fd3..02f1e81 100644
>>>> --- a/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>>>> @@ -3362,6 +3362,9 @@ fs_visitor::nir_emit_intrinsic(const fs_builder &bld, nir_intrinsic_instr *instr
>>>> nir->info.num_ubos - 1);
>>>> }
>>>>
>>>> + /* Number of 32-bit slots in the type */
>>>> + unsigned type_slots = MAX2(1, type_sz(dest.type) / 4);
>>>> +
>>>> nir_const_value *const_offset = nir_src_as_const_value(instr->src[1]);
>>>> if (const_offset == NULL) {
>>>> fs_reg base_offset = retype(get_nir_src(instr->src[1]),
>>>> @@ -3369,19 +3372,54 @@ fs_visitor::nir_emit_intrinsic(const fs_builder &bld, nir_intrinsic_instr *instr
>>>>
>>>> for (int i = 0; i < instr->num_components; i++)
>>>> VARYING_PULL_CONSTANT_LOAD(bld, offset(dest, bld, i), surf_index,
>>>> - base_offset, i * 4);
>>>> + base_offset, i * 4 * type_slots);
>>>
>>> Why not 'i * type_sz(...)'? As before it seems like type_slots is just
>>> going to introduce rounding errors here for no benefit?
>>>
>>
>> Right, I will fix it.
>>
>>>> } else {
>>>> + /* Even if we are loading doubles, a pull constant load will load
>>>> + * a 32-bit vec4, so should only reserve vgrf space for that. If we
>>>> + * need to load a full dvec4 we will have to emit 2 loads. This is
>>>> + * similar to demote_pull_constants(), except that in that case we
>>>> + * see individual accesses to each component of the vector and then
>>>> + * we let CSE deal with duplicate loads. Here we see a vector access
>>>> + * and we have to split it if necessary.
>>>> + */
>>>> fs_reg packed_consts = vgrf(glsl_type::float_type);
>>>> packed_consts.type = dest.type;
>>>>
>>>> - struct brw_reg const_offset_reg = brw_imm_ud(const_offset->u32[0] & ~15);
>>>> - bld.emit(FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD, packed_consts,
>>>> - surf_index, const_offset_reg);
>>>> + unsigned const_offset_aligned = const_offset->u32[0] & ~15;
>>>> +
>>>> + /* A vec4 only contains half of a dvec4, if we need more than 2
>>>> + * components of a dvec4 we will have to issue another load for
>>>> + * components z and w
>>>> + */
>>>> + int num_components;
>>>> + if (type_slots == 1)
>>>> + num_components = instr->num_components;
>>>> + else
>>>> + num_components = MIN2(2, instr->num_components);
>>>>
>>>> - const fs_reg consts = byte_offset(packed_consts, const_offset->u32[0] % 16);
>>>> + int remaining_components = instr->num_components;
>>>> + while (remaining_components > 0) {
>>>> + /* Read the vec4 from a 16-byte aligned offset */
>>>> + struct brw_reg const_offset_reg = brw_imm_ud(const_offset_aligned);
>>>> + bld.emit(FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD,
>>>> + retype(packed_consts, BRW_REGISTER_TYPE_F),
>>>> + surf_index, const_offset_reg);
>>>>
>>>> - for (unsigned i = 0; i < instr->num_components; i++)
>>>> - bld.MOV(offset(dest, bld, i), component(consts, i));
>>>> + const fs_reg consts = byte_offset(packed_consts, (const_offset->u32[0] % 16));
>>>
>>> This looks really fishy to me, if the initial offset is not 16B aligned
>>> you'll apply the same sub-16B offset to the result from each one of the
>>> subsequent pull constant loads.
>>
>> This cannot happen thanks to the layout alignment rules, see below.
>>
>>> Also you don't seem to take into
>>> account whether the initial offset is misaligned in the calculation of
>>> num_components -- If it is it looks like the first pull constant load
>>> could return less than "num_components" usable components and you would
>>> end up reading past the end of the return payload of the message.
>>>
>>
>> This is not a problem thanks to std140 layout alignment rules. The only
>> case we have a 16B misalignment is when we have a bunch of scalars (or
>> vec2s) defined together.
>>
>> In case of the scalars, we are going to pick only one element in each
>> case, so there is no problem. In case of the vec2s, the second vec2 is
>> at offset 8B and we only read two components in total, then there isn't
>> a second pull constant load.
>>
>> When we are dealing with doubles is the same but multiply by two: the
>> double case would be like the vec2 one but reading one double. dvec2 is
>> 16B aligned. dvec3 and dvec4 are 32B aligned. Everything else is aligned
>> 16B (vec4 size).
>>
>> std430 layout doesn't apply to UBOs, although it is not a problem if it
>> does in the future. In Mesa, the rest of layouts (shared, packed) follow
>> std140's alignment rules.
>>
> Oh, wow, that's making quite some assumptions about the behavior of the
> front-end, sounds like we could use some assertions verifying that the
> assumptions you've made about the upper layers don't accidentally break,
> e.g. somewhere close to the num_components computation:
>
> | /* The computation of num_components doesn't take into account
> | * misalignment, which should be okay according to std140 vector
> | * alignment rules.
> | */
> | assert(const_offset->u32[0] % 16 +
> | type_sz(dest.type) * num_components <= 16);
>
> and inside the remaining_components while loop:
>
> | /* XXX - This doesn't update the sub-16B offset across iterations of
> | * the loop, which should work for std140 vector alignment rules.
> | */
> | assert(dest_offset == 0 || const_offset->u32[0] % 16 == 0);
>
> Honestly, none of these assumptions seem necessary nor particularly
> helpful for the logic above -- Explaining and verifying them seems
> almost more effort (and is certainly more confusing) than handling
> arbitrary offsets consistently. But because Ken seems rather impatient
> to get FP64 support upstream I suggest you just add the the two
> assertions above and my:
>
> Reviewed-by: Francisco Jerez <currojerez at riseup.net>
>
Thanks a lot Curro, I added those asserts and your R-b.
> to this patch and push it. As a follow-up clean-up you'd probably be
> able to both simplify the iteration logic and handle arbitrary alignment
> with less than half the code by doing something like:
>
> | for (unsigned c = 0; c < total_components;) {
> | const fs_reg packed_consts = bld.vgrf(BRW_REGISTER_TYPE_F);
> | const unsigned base = base_offset_from_the_intrinsic + c * type_size;
> |
> | /* Number of usable components we get from the next 16B-aligned load. */
> | const unsigned count = min(total_components - c, (16 - base % 16) / type_size);
> |
> | bld.emit(FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD,
> | packed_consts, surf_index, brw_imm_ud(base & ~15));
> |
> | const fs_reg consts = retype(byte_offset(packed_consts, base & 15),
> | dest.type);
> |
> | for (unsigned d = 0; d < count; d++)
> | bld.MOV(offset(dest, bld, c + d), component(consts, d));
> |
> | c += count;
> | }
>
> This gets rid of two out of three induction variables you use in the
> while loop, the second restriction disappears because you always have
> the *actual* offset in bytes of the next constant to fetch available,
> and the first restriction is handled by a single '- base % 16' term in
> the definition of 'count'.
>
> Another possibility to consider is to ditch all the above and extend the
> generator so it's able to fetch a whole 32B block of constants at a time
> with a single message, which would let you handle this for 64bit types
> in exactly the same way that the back-end currently handles 32bit
> constant loads with a single pull load per dvec4.
>
OK, we will work on this in the follow-up patch series.
Thanks again!
Sam
>> Sam
>>
>>>> + unsigned dest_offset = instr->num_components - remaining_components;
>>>> +
>>>> + for (int i = 0; i < num_components; i++)
>>>> + bld.MOV(offset(dest, bld, i + dest_offset), component(consts, i));
>>>> +
>>>> + /* If this is a large enough 64-bit load, we will need to emit
>>>> + * another message
>>>> + */
>>>> + remaining_components -= num_components;
>>>> + assert(remaining_components == 0 ||
>>>> + (remaining_components <= 2 && type_slots == 2));
>>>> + num_components = remaining_components;
>>>> + const_offset_aligned += 16;
>>>> + }
>>>> }
>>>> break;
>>>> }
>>>> --
>>>> 2.5.0
>>>>
>>>> _______________________________________________
>>>> mesa-dev mailing list
>>>> mesa-dev at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
More information about the mesa-dev
mailing list