[Mesa-dev] [PATCH v2 15/30] i965/fs: support doubles with UBO loads

Sat May 14 10:46:35 UTC 2016

On 14/05/16 01:16, Francisco Jerez wrote:
> Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
> 
>> From: Iago Toral Quiroga <itoral at igalia.com>
>>
>> UBO loads with constant offset use the UNIFORM_PULL_CONSTANT_LOAD
>> instruction, which reads 16 bytes (a vec4) of data from memory. For dvec
>> types this only provides components x and y. Thus, if we are reading
>> more than 2 components we need to issue a second load at offset+16 to
>> read the next 16-byte chunk with components w and z.
>>
>> UBO loads with non-constant offset emit a load for each component
>> in the vector (and rely in CSE to fix redundant loads), so we only
>> need to consider the size of the data type when computing the offset
>> of each element in a vector.
>>
>> v2 (Sam):
>> - Adapt the code to use component() (Curro).
>>
>> Signed-off-by: Samuel Iglesias Gonsálvez <siglesias at igalia.com>
>> Reviewed-by: Kenneth Graunke <kenneth at whitecape.org>
>> ---
>>  src/mesa/drivers/dri/i965/brw_fs_nir.cpp | 52 +++++++++++++++++++++++++++-----
>>  1 file changed, 45 insertions(+), 7 deletions(-)
>>
>> diff --git a/src/mesa/drivers/dri/i965/brw_fs_nir.cpp b/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>> index 2d57fd3..02f1e81 100644
>> --- a/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>> +++ b/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>> @@ -3362,6 +3362,9 @@ fs_visitor::nir_emit_intrinsic(const fs_builder &bld, nir_intrinsic_instr *instr
>>                                 nir->info.num_ubos - 1);
>>        }
>>  
>> +      /* Number of 32-bit slots in the type */
>> +      unsigned type_slots = MAX2(1, type_sz(dest.type) / 4);
>> +
>>        nir_const_value *const_offset = nir_src_as_const_value(instr->src[1]);
>>        if (const_offset == NULL) {
>>           fs_reg base_offset = retype(get_nir_src(instr->src[1]),
>> @@ -3369,19 +3372,54 @@ fs_visitor::nir_emit_intrinsic(const fs_builder &bld, nir_intrinsic_instr *instr
>>  
>>           for (int i = 0; i < instr->num_components; i++)
>>              VARYING_PULL_CONSTANT_LOAD(bld, offset(dest, bld, i), surf_index,
>> -                                       base_offset, i * 4);
>> +                                       base_offset, i * 4 * type_slots);
> 
> Why not 'i * type_sz(...)'?  As before it seems like type_slots is just
> going to introduce rounding errors here for no benefit?
> 

Right, I will fix it.

>>        } else {
>> +         /* Even if we are loading doubles, a pull constant load will load
>> +          * a 32-bit vec4, so should only reserve vgrf space for that. If we
>> +          * need to load a full dvec4 we will have to emit 2 loads. This is
>> +          * similar to demote_pull_constants(), except that in that case we
>> +          * see individual accesses to each component of the vector and then
>> +          * we let CSE deal with duplicate loads. Here we see a vector access
>> +          * and we have to split it if necessary.
>> +          */
>>           fs_reg packed_consts = vgrf(glsl_type::float_type);
>>           packed_consts.type = dest.type;
>>  
>> -         struct brw_reg const_offset_reg = brw_imm_ud(const_offset->u32[0] & ~15);
>> -         bld.emit(FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD, packed_consts,
>> -                  surf_index, const_offset_reg);
>> +         unsigned const_offset_aligned = const_offset->u32[0] & ~15;
>> +
>> +         /* A vec4 only contains half of a dvec4, if we need more than 2
>> +          * components of a dvec4 we will have to issue another load for
>> +          * components z and w
>> +          */
>> +         int num_components;
>> +         if (type_slots == 1)
>> +            num_components = instr->num_components;
>> +         else
>> +            num_components = MIN2(2, instr->num_components);
>>
>> -         const fs_reg consts = byte_offset(packed_consts, const_offset->u32[0] % 16);
>> +         int remaining_components = instr->num_components;
>> +         while (remaining_components > 0) {
>> +            /* Read the vec4 from a 16-byte aligned offset */
>> +            struct brw_reg const_offset_reg = brw_imm_ud(const_offset_aligned);
>> +            bld.emit(FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD,
>> +                     retype(packed_consts, BRW_REGISTER_TYPE_F),
>> +                     surf_index, const_offset_reg);
>>  
>> -         for (unsigned i = 0; i < instr->num_components; i++)
>> -            bld.MOV(offset(dest, bld, i), component(consts, i));
>> +            const fs_reg consts = byte_offset(packed_consts, (const_offset->u32[0] % 16));
> 
> This looks really fishy to me, if the initial offset is not 16B aligned
> you'll apply the same sub-16B offset to the result from each one of the
> subsequent pull constant loads.

This cannot happen thanks to the layout alignment rules, see below.

> Also you don't seem to take into
> account whether the initial offset is misaligned in the calculation of
> num_components -- If it is it looks like the first pull constant load
> could return less than "num_components" usable components and you would
> end up reading past the end of the return payload of the message.
> 

This is not a problem thanks to std140 layout alignment rules. The only
case we have a 16B misalignment is when we have a bunch of scalars (or
vec2s) defined together.

In case of the scalars, we are going to pick only one element in each
case, so there is no problem. In case of the vec2s, the second vec2 is
at offset 8B and we only read two components in total, then there isn't
a second pull constant load.

When we are dealing with doubles is the same but multiply by two: the
double case would be like the vec2 one but reading one double. dvec2 is
16B aligned. dvec3 and dvec4 are 32B aligned. Everything else is aligned
16B (vec4 size).

std430 layout doesn't apply to UBOs, although it is not a problem if it
does in the future. In Mesa, the rest of layouts (shared, packed) follow
std140's alignment rules.

Sam

>> +            unsigned dest_offset = instr->num_components - remaining_components;
>> +
>> +            for (int i = 0; i < num_components; i++)
>> +               bld.MOV(offset(dest, bld, i + dest_offset), component(consts, i));
>> +
>> +            /* If this is a large enough 64-bit load, we will need to emit
>> +             * another message
>> +             */
>> +            remaining_components -= num_components;
>> +            assert(remaining_components == 0 ||
>> +                   (remaining_components <= 2 && type_slots == 2));
>> +            num_components = remaining_components;
>> +            const_offset_aligned += 16;
>> +         }
>>        }
>>        break;
>>     }
>> -- 
>> 2.5.0
>>
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev