[Mesa-dev] [PATCH 09/12] i965/fs: make SIMD-splitting respect the original stride/offset

Mon Aug 17 01:33:17 PDT 2015

On Mon, Aug 17, 2015 at 1:27 AM, Connor Abbott <cwabbott0 at gmail.com> wrote:
> On Mon, Aug 17, 2015 at 12:09 AM, Pohjolainen, Topi
> <topi.pohjolainen at intel.com> wrote:
>> On Fri, Aug 14, 2015 at 03:30:18PM -0700, Connor Abbott wrote:
>>> In some cases, we need to emit ALU instructions with a certain stride
>>> due to a HW limitation. When splitting that instruction, we need to
>>> respect the original stride when creating the temporaries we load from
>>> and store into. Otherwise, we'll reintroduce the problem we were trying
>>> to work around.
>>>
>>> Signed-off-by: Connor Abbott <connor.w.abbott at intel.com>
>>> ---
>>>  src/mesa/drivers/dri/i965/brw_fs.cpp | 14 ++++++++++----
>>>  1 file changed, 10 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>> index 812648f..386e9a2 100644
>>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
>>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>> @@ -2370,12 +2370,14 @@ fs_visitor::opt_register_renaming()
>>>
>>>        if (depth == 0 &&
>>>            inst->dst.file == GRF &&
>>> -          alloc.sizes[inst->dst.reg] == inst->exec_size / 8 &&
>>> +          alloc.sizes[inst->dst.reg] ==
>>> +            inst->dst.component_size(inst->exec_size) &&
>>>            !inst->is_partial_write()) {
>>>           if (remap[dst] == -1) {
>>>              remap[dst] = dst;
>>>           } else {
>>> -            remap[dst] = alloc.allocate(inst->exec_size / 8);
>>> +            remap[dst] =
>>> +               alloc.allocate(inst->dst.component_size(inst->exec_size));
>>>              inst->dst.reg = remap[dst];
>>>              progress = true;
>>>           }
>>> @@ -4334,6 +4336,8 @@ fs_visitor::lower_simd_width()
>>>                     * temporary passed as source to the lowered instruction.
>>>                     */
>>>                    split_inst.src[j] = lbld.vgrf(inst->src[j].type, src_size);
>>> +                  split_inst.src[j].subreg_offset = inst->src[j].subreg_offset;
>>
>> The fixes for the wider component size (64-bits) and the stride are clearly
>> needed for doubles. I'm wondering though why the sub-register offset hasn't
>> caused us any problems before. That change is not needed just for doubles,
>> is it?
>
> I think we haven't noticed this before for two reasons:
>
> 1. It usually isn't needed for correctness.
> 2. We don't really use subreg_offset that much anyways.
>
> The splitting code is moving the inputs into a bunch of temporary
> registers, doing the split operation into a new bunch of temporaries,
> and then moving those temporaries back into the destination register.
> It's copying over the stride and subreg_offset fields from the
> original inputs and outputs, so when all is said and done, it's
> producing the same result and using the same sources. The difference
> is that the split instructions themselves weren't copying over those
> fields, meaning that they were always operating on packed registers
> (stride of 1 and subreg_offset of 0). This is normally ok, except
> where, due to a restriction, the split instruction *must* have the
> same stride/offset as the original instruction. For fp64, DF->F
> conversion must preserve the stride since we have to use a stride of 2
> for the output. The only thing I can think of that has to preserve the
> subreg_offset is math instructions, since the source offset has to be
> the same as the destination offset, but that should be a corner case
> and I'm not sure whether we'll ever get in a situation where a math
> instruction has a non-zero subreg_offset.

Actually, if we don't respect the subreg_offset, then both of the
lowered instructions will have an offset of 0, which is totally
fine... so I guess the subreg_offset part is solely about helping
optimizations (and consistency).

>
> But although I don't think it's necessary for correctness, but I think
> it should help with register coalescing. For example, take
> pack/unpackDouble2x32. I've implemented it in terms of two strided
> MOV's, which need to be split by this pass. Say we have something
> like:
>
> uint hi = unpackDouble2x32(input).y;
>
> in SIMD16 mode, this turns into:
>
> mov(16) hi.1<2>:UD input:DF
>
> But this needs to be split, and after the splitting pass (without
> fixing up the subreg_offsest like this patch does) and lowering away
> all the LOAD_PAYLOAD's, it gets turned into something like:
>
> mov(8) temp1:DF input:DF 1Q
> mov(8) temp2:DF input+1.0:DF 2Q
> mov(8) temp3<2>:UD temp1:DF 1Q
> mov(8) temp4<2>:UD temp2:DF 2Q
> mov(8) hi+0.1<2>:UD temp3<2>:UD WE_all //The WE_all is due to a hack
> in the SIMD lowering code
> mov(8) hi+1.1<2>:UD temp4<2>:UD WE_all
>
> Copy prop should have no problem cleaning up the first two MOV's, but
> the last two are more of an issue -- we currently don't have a pass
> that can detect that you can offset all reads/writes of temp3 and
> temp4 by 1 and coalesce temp3 and temp4 into hi and hi+1, and it would
> be really hard to write something like that without SSA. But if we
> pass through the offset while splitting it up, then things match up:
>
> mov(8) temp1:DF input:DF 1Q
> mov(8) temp2:DF input+1.0:DF 2Q
> mov(8) temp3+0.1<2>:UD temp1:DF 1Q
> mov(8) temp4+0.1<2>:UD temp2:DF 2Q
> mov(8) hi+0.1<2>:UD temp3+0.1<2>:UD WE_all
> mov(8) hi+1.1<2>:UD temp4+0.1<2>:UD WE_all
>
> and we should be able to coalesce away all the extra MOV's.
>
> I should probably put some of this in the commit message...
>
> Also, I realized that this patch should really be split in two, since
> the fixes to register renaming are independent of this -- whoops!
>
>>
>>> +                  split_inst.src[j].stride = inst->src[j].stride;
>>>                    emit_transpose(lbld.group(copy_width, 0),
>>>                                   split_inst.src[j], &src, 1, src_size, n);
>>>                 }
>>> @@ -4343,8 +4347,10 @@ fs_visitor::lower_simd_width()
>>>                 /* Allocate enough space to hold the result of the lowered
>>>                  * instruction and fix up the number of registers written.
>>>                  */
>>> -               split_inst.dst = dsts[i] =
>>> -                  lbld.vgrf(inst->dst.type, dst_size);
>>> +               fs_reg dst = lbld.vgrf(inst->dst.type, dst_size);
>>> +               dst.stride = inst->dst.stride;
>>> +               dst.subreg_offset = inst->dst.subreg_offset;
>>> +               split_inst.dst = dsts[i] = dst;
>>>                 split_inst.regs_written =
>>>                    DIV_ROUND_UP(inst->regs_written * lower_width,
>>>                                 inst->exec_size);
>>> --
>>> 2.4.3
>>>
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev