[Mesa-dev] [PATCH 09/12] i965/fs: make SIMD-splitting respect the original stride/offset

Mon Aug 17 01:27:00 PDT 2015

On Mon, Aug 17, 2015 at 12:09 AM, Pohjolainen, Topi
<topi.pohjolainen at intel.com> wrote:
> On Fri, Aug 14, 2015 at 03:30:18PM -0700, Connor Abbott wrote:
>> In some cases, we need to emit ALU instructions with a certain stride
>> due to a HW limitation. When splitting that instruction, we need to
>> respect the original stride when creating the temporaries we load from
>> and store into. Otherwise, we'll reintroduce the problem we were trying
>> to work around.
>>
>> Signed-off-by: Connor Abbott <connor.w.abbott at intel.com>
>> ---
>>  src/mesa/drivers/dri/i965/brw_fs.cpp | 14 ++++++++++----
>>  1 file changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
>> index 812648f..386e9a2 100644
>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
>> @@ -2370,12 +2370,14 @@ fs_visitor::opt_register_renaming()
>>
>>        if (depth == 0 &&
>>            inst->dst.file == GRF &&
>> -          alloc.sizes[inst->dst.reg] == inst->exec_size / 8 &&
>> +          alloc.sizes[inst->dst.reg] ==
>> +            inst->dst.component_size(inst->exec_size) &&
>>            !inst->is_partial_write()) {
>>           if (remap[dst] == -1) {
>>              remap[dst] = dst;
>>           } else {
>> -            remap[dst] = alloc.allocate(inst->exec_size / 8);
>> +            remap[dst] =
>> +               alloc.allocate(inst->dst.component_size(inst->exec_size));
>>              inst->dst.reg = remap[dst];
>>              progress = true;
>>           }
>> @@ -4334,6 +4336,8 @@ fs_visitor::lower_simd_width()
>>                     * temporary passed as source to the lowered instruction.
>>                     */
>>                    split_inst.src[j] = lbld.vgrf(inst->src[j].type, src_size);
>> +                  split_inst.src[j].subreg_offset = inst->src[j].subreg_offset;
>
> The fixes for the wider component size (64-bits) and the stride are clearly
> needed for doubles. I'm wondering though why the sub-register offset hasn't
> caused us any problems before. That change is not needed just for doubles,
> is it?

I think we haven't noticed this before for two reasons:

1. It usually isn't needed for correctness.
2. We don't really use subreg_offset that much anyways.

The splitting code is moving the inputs into a bunch of temporary
registers, doing the split operation into a new bunch of temporaries,
and then moving those temporaries back into the destination register.
It's copying over the stride and subreg_offset fields from the
original inputs and outputs, so when all is said and done, it's
producing the same result and using the same sources. The difference
is that the split instructions themselves weren't copying over those
fields, meaning that they were always operating on packed registers
(stride of 1 and subreg_offset of 0). This is normally ok, except
where, due to a restriction, the split instruction *must* have the
same stride/offset as the original instruction. For fp64, DF->F
conversion must preserve the stride since we have to use a stride of 2
for the output. The only thing I can think of that has to preserve the
subreg_offset is math instructions, since the source offset has to be
the same as the destination offset, but that should be a corner case
and I'm not sure whether we'll ever get in a situation where a math
instruction has a non-zero subreg_offset.

But although I don't think it's necessary for correctness, but I think
it should help with register coalescing. For example, take
pack/unpackDouble2x32. I've implemented it in terms of two strided
MOV's, which need to be split by this pass. Say we have something
like:

uint hi = unpackDouble2x32(input).y;

in SIMD16 mode, this turns into:

mov(16) hi.1<2>:UD input:DF

But this needs to be split, and after the splitting pass (without
fixing up the subreg_offsest like this patch does) and lowering away
all the LOAD_PAYLOAD's, it gets turned into something like:

mov(8) temp1:DF input:DF 1Q
mov(8) temp2:DF input+1.0:DF 2Q
mov(8) temp3<2>:UD temp1:DF 1Q
mov(8) temp4<2>:UD temp2:DF 2Q
mov(8) hi+0.1<2>:UD temp3<2>:UD WE_all //The WE_all is due to a hack
in the SIMD lowering code
mov(8) hi+1.1<2>:UD temp4<2>:UD WE_all

Copy prop should have no problem cleaning up the first two MOV's, but
the last two are more of an issue -- we currently don't have a pass
that can detect that you can offset all reads/writes of temp3 and
temp4 by 1 and coalesce temp3 and temp4 into hi and hi+1, and it would
be really hard to write something like that without SSA. But if we
pass through the offset while splitting it up, then things match up:

mov(8) temp1:DF input:DF 1Q
mov(8) temp2:DF input+1.0:DF 2Q
mov(8) temp3+0.1<2>:UD temp1:DF 1Q
mov(8) temp4+0.1<2>:UD temp2:DF 2Q
mov(8) hi+0.1<2>:UD temp3+0.1<2>:UD WE_all
mov(8) hi+1.1<2>:UD temp4+0.1<2>:UD WE_all

and we should be able to coalesce away all the extra MOV's.

I should probably put some of this in the commit message...

Also, I realized that this patch should really be split in two, since
the fixes to register renaming are independent of this -- whoops!

>
>> +                  split_inst.src[j].stride = inst->src[j].stride;
>>                    emit_transpose(lbld.group(copy_width, 0),
>>                                   split_inst.src[j], &src, 1, src_size, n);
>>                 }
>> @@ -4343,8 +4347,10 @@ fs_visitor::lower_simd_width()
>>                 /* Allocate enough space to hold the result of the lowered
>>                  * instruction and fix up the number of registers written.
>>                  */
>> -               split_inst.dst = dsts[i] =
>> -                  lbld.vgrf(inst->dst.type, dst_size);
>> +               fs_reg dst = lbld.vgrf(inst->dst.type, dst_size);
>> +               dst.stride = inst->dst.stride;
>> +               dst.subreg_offset = inst->dst.subreg_offset;
>> +               split_inst.dst = dsts[i] = dst;
>>                 split_inst.regs_written =
>>                    DIV_ROUND_UP(inst->regs_written * lower_width,
>>                                 inst->exec_size);
>> --
>> 2.4.3
>>
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev