[Mesa-dev] [PATCH v2 15/30] i965/fs: support doubles with UBO loads
Francisco Jerez
currojerez at riseup.net
Sat May 14 22:13:52 UTC 2016
Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
> On 14/05/16 01:16, Francisco Jerez wrote:
>> Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
>>
>>> From: Iago Toral Quiroga <itoral at igalia.com>
>>>
>>> UBO loads with constant offset use the UNIFORM_PULL_CONSTANT_LOAD
>>> instruction, which reads 16 bytes (a vec4) of data from memory. For dvec
>>> types this only provides components x and y. Thus, if we are reading
>>> more than 2 components we need to issue a second load at offset+16 to
>>> read the next 16-byte chunk with components w and z.
>>>
>>> UBO loads with non-constant offset emit a load for each component
>>> in the vector (and rely in CSE to fix redundant loads), so we only
>>> need to consider the size of the data type when computing the offset
>>> of each element in a vector.
>>>
>>> v2 (Sam):
>>> - Adapt the code to use component() (Curro).
>>>
>>> Signed-off-by: Samuel Iglesias Gonsálvez <siglesias at igalia.com>
>>> Reviewed-by: Kenneth Graunke <kenneth at whitecape.org>
>>> ---
>>> src/mesa/drivers/dri/i965/brw_fs_nir.cpp | 52 +++++++++++++++++++++++++++-----
>>> 1 file changed, 45 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs_nir.cpp b/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>>> index 2d57fd3..02f1e81 100644
>>> --- a/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>>> +++ b/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>>> @@ -3362,6 +3362,9 @@ fs_visitor::nir_emit_intrinsic(const fs_builder &bld, nir_intrinsic_instr *instr
>>> nir->info.num_ubos - 1);
>>> }
>>>
>>> + /* Number of 32-bit slots in the type */
>>> + unsigned type_slots = MAX2(1, type_sz(dest.type) / 4);
>>> +
>>> nir_const_value *const_offset = nir_src_as_const_value(instr->src[1]);
>>> if (const_offset == NULL) {
>>> fs_reg base_offset = retype(get_nir_src(instr->src[1]),
>>> @@ -3369,19 +3372,54 @@ fs_visitor::nir_emit_intrinsic(const fs_builder &bld, nir_intrinsic_instr *instr
>>>
>>> for (int i = 0; i < instr->num_components; i++)
>>> VARYING_PULL_CONSTANT_LOAD(bld, offset(dest, bld, i), surf_index,
>>> - base_offset, i * 4);
>>> + base_offset, i * 4 * type_slots);
>>
>> Why not 'i * type_sz(...)'? As before it seems like type_slots is just
>> going to introduce rounding errors here for no benefit?
>>
>
> Right, I will fix it.
>
>>> } else {
>>> + /* Even if we are loading doubles, a pull constant load will load
>>> + * a 32-bit vec4, so should only reserve vgrf space for that. If we
>>> + * need to load a full dvec4 we will have to emit 2 loads. This is
>>> + * similar to demote_pull_constants(), except that in that case we
>>> + * see individual accesses to each component of the vector and then
>>> + * we let CSE deal with duplicate loads. Here we see a vector access
>>> + * and we have to split it if necessary.
>>> + */
>>> fs_reg packed_consts = vgrf(glsl_type::float_type);
>>> packed_consts.type = dest.type;
>>>
>>> - struct brw_reg const_offset_reg = brw_imm_ud(const_offset->u32[0] & ~15);
>>> - bld.emit(FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD, packed_consts,
>>> - surf_index, const_offset_reg);
>>> + unsigned const_offset_aligned = const_offset->u32[0] & ~15;
>>> +
>>> + /* A vec4 only contains half of a dvec4, if we need more than 2
>>> + * components of a dvec4 we will have to issue another load for
>>> + * components z and w
>>> + */
>>> + int num_components;
>>> + if (type_slots == 1)
>>> + num_components = instr->num_components;
>>> + else
>>> + num_components = MIN2(2, instr->num_components);
>>>
>>> - const fs_reg consts = byte_offset(packed_consts, const_offset->u32[0] % 16);
>>> + int remaining_components = instr->num_components;
>>> + while (remaining_components > 0) {
>>> + /* Read the vec4 from a 16-byte aligned offset */
>>> + struct brw_reg const_offset_reg = brw_imm_ud(const_offset_aligned);
>>> + bld.emit(FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD,
>>> + retype(packed_consts, BRW_REGISTER_TYPE_F),
>>> + surf_index, const_offset_reg);
>>>
>>> - for (unsigned i = 0; i < instr->num_components; i++)
>>> - bld.MOV(offset(dest, bld, i), component(consts, i));
>>> + const fs_reg consts = byte_offset(packed_consts, (const_offset->u32[0] % 16));
>>
>> This looks really fishy to me, if the initial offset is not 16B aligned
>> you'll apply the same sub-16B offset to the result from each one of the
>> subsequent pull constant loads.
>
> This cannot happen thanks to the layout alignment rules, see below.
>
>> Also you don't seem to take into
>> account whether the initial offset is misaligned in the calculation of
>> num_components -- If it is it looks like the first pull constant load
>> could return less than "num_components" usable components and you would
>> end up reading past the end of the return payload of the message.
>>
>
> This is not a problem thanks to std140 layout alignment rules. The only
> case we have a 16B misalignment is when we have a bunch of scalars (or
> vec2s) defined together.
>
> In case of the scalars, we are going to pick only one element in each
> case, so there is no problem. In case of the vec2s, the second vec2 is
> at offset 8B and we only read two components in total, then there isn't
> a second pull constant load.
>
> When we are dealing with doubles is the same but multiply by two: the
> double case would be like the vec2 one but reading one double. dvec2 is
> 16B aligned. dvec3 and dvec4 are 32B aligned. Everything else is aligned
> 16B (vec4 size).
>
> std430 layout doesn't apply to UBOs, although it is not a problem if it
> does in the future. In Mesa, the rest of layouts (shared, packed) follow
> std140's alignment rules.
>
Oh, wow, that's making quite some assumptions about the behavior of the
front-end, sounds like we could use some assertions verifying that the
assumptions you've made about the upper layers don't accidentally break,
e.g. somewhere close to the num_components computation:
| /* The computation of num_components doesn't take into account
| * misalignment, which should be okay according to std140 vector
| * alignment rules.
| */
| assert(const_offset->u32[0] % 16 +
| type_sz(dest.type) * num_components <= 16);
and inside the remaining_components while loop:
| /* XXX - This doesn't update the sub-16B offset across iterations of
| * the loop, which should work for std140 vector alignment rules.
| */
| assert(dest_offset == 0 || const_offset->u32[0] % 16 == 0);
Honestly, none of these assumptions seem necessary nor particularly
helpful for the logic above -- Explaining and verifying them seems
almost more effort (and is certainly more confusing) than handling
arbitrary offsets consistently. But because Ken seems rather impatient
to get FP64 support upstream I suggest you just add the the two
assertions above and my:
Reviewed-by: Francisco Jerez <currojerez at riseup.net>
to this patch and push it. As a follow-up clean-up you'd probably be
able to both simplify the iteration logic and handle arbitrary alignment
with less than half the code by doing something like:
| for (unsigned c = 0; c < total_components;) {
| const fs_reg packed_consts = bld.vgrf(BRW_REGISTER_TYPE_F);
| const unsigned base = base_offset_from_the_intrinsic + c * type_size;
|
| /* Number of usable components we get from the next 16B-aligned load. */
| const unsigned count = min(total_components - c, (16 - base % 16) / type_size);
|
| bld.emit(FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD,
| packed_consts, surf_index, brw_imm_ud(base & ~15));
|
| const fs_reg consts = retype(byte_offset(packed_consts, base & 15),
| dest.type);
|
| for (unsigned d = 0; d < count; d++)
| bld.MOV(offset(dest, bld, c + d), component(consts, d));
|
| c += count;
| }
This gets rid of two out of three induction variables you use in the
while loop, the second restriction disappears because you always have
the *actual* offset in bytes of the next constant to fetch available,
and the first restriction is handled by a single '- base % 16' term in
the definition of 'count'.
Another possibility to consider is to ditch all the above and extend the
generator so it's able to fetch a whole 32B block of constants at a time
with a single message, which would let you handle this for 64bit types
in exactly the same way that the back-end currently handles 32bit
constant loads with a single pull load per dvec4.
> Sam
>
>>> + unsigned dest_offset = instr->num_components - remaining_components;
>>> +
>>> + for (int i = 0; i < num_components; i++)
>>> + bld.MOV(offset(dest, bld, i + dest_offset), component(consts, i));
>>> +
>>> + /* If this is a large enough 64-bit load, we will need to emit
>>> + * another message
>>> + */
>>> + remaining_components -= num_components;
>>> + assert(remaining_components == 0 ||
>>> + (remaining_components <= 2 && type_slots == 2));
>>> + num_components = remaining_components;
>>> + const_offset_aligned += 16;
>>> + }
>>> }
>>> break;
>>> }
>>> --
>>> 2.5.0
>>>
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20160514/a04752a6/attachment-0001.sig>
More information about the mesa-dev
mailing list