[Mesa-dev] [PATCH 09/12] i965/fs: make SIMD-splitting respect the original stride/offset

Mon Aug 17 02:41:12 PDT 2015

Connor Abbott <cwabbott0 at gmail.com> writes:

> On Mon, Aug 17, 2015 at 12:09 AM, Pohjolainen, Topi
> <topi.pohjolainen at intel.com> wrote:
>> On Fri, Aug 14, 2015 at 03:30:18PM -0700, Connor Abbott wrote:
>>> In some cases, we need to emit ALU instructions with a certain stride
>>> due to a HW limitation. When splitting that instruction, we need to
>>> respect the original stride when creating the temporaries we load from
>>> and store into. Otherwise, we'll reintroduce the problem we were trying
>>> to work around.
>>>
>>> Signed-off-by: Connor Abbott <connor.w.abbott at intel.com>
>>> ---
>>>  src/mesa/drivers/dri/i965/brw_fs.cpp | 14 ++++++++++----
>>>  1 file changed, 10 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>> index 812648f..386e9a2 100644
>>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
>>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>> @@ -2370,12 +2370,14 @@ fs_visitor::opt_register_renaming()
>>>
>>>        if (depth == 0 &&
>>>            inst->dst.file == GRF &&
>>> -          alloc.sizes[inst->dst.reg] == inst->exec_size / 8 &&
>>> +          alloc.sizes[inst->dst.reg] ==
>>> +            inst->dst.component_size(inst->exec_size) &&
>>>            !inst->is_partial_write()) {
>>>           if (remap[dst] == -1) {
>>>              remap[dst] = dst;
>>>           } else {
>>> -            remap[dst] = alloc.allocate(inst->exec_size / 8);
>>> +            remap[dst] =
>>> +               alloc.allocate(inst->dst.component_size(inst->exec_size));
>>>              inst->dst.reg = remap[dst];
>>>              progress = true;
>>>           }
>>> @@ -4334,6 +4336,8 @@ fs_visitor::lower_simd_width()
>>>                     * temporary passed as source to the lowered instruction.
>>>                     */
>>>                    split_inst.src[j] = lbld.vgrf(inst->src[j].type, src_size);
>>> +                  split_inst.src[j].subreg_offset = inst->src[j].subreg_offset;
>>
>> The fixes for the wider component size (64-bits) and the stride are clearly
>> needed for doubles. I'm wondering though why the sub-register offset hasn't
>> caused us any problems before. That change is not needed just for doubles,
>> is it?
>
> I think we haven't noticed this before for two reasons:
>
> 1. It usually isn't needed for correctness.
> 2. We don't really use subreg_offset that much anyways.
>
> The splitting code is moving the inputs into a bunch of temporary
> registers, doing the split operation into a new bunch of temporaries,
> and then moving those temporaries back into the destination register.
> It's copying over the stride and subreg_offset fields from the
> original inputs and outputs, so when all is said and done, it's
> producing the same result and using the same sources. The difference
> is that the split instructions themselves weren't copying over those
> fields, meaning that they were always operating on packed registers
> (stride of 1 and subreg_offset of 0). This is normally ok, except
> where, due to a restriction, the split instruction *must* have the
> same stride/offset as the original instruction. For fp64, DF->F
> conversion must preserve the stride since we have to use a stride of 2
> for the output. The only thing I can think of that has to preserve the
> subreg_offset is math instructions, since the source offset has to be
> the same as the destination offset, but that should be a corner case
> and I'm not sure whether we'll ever get in a situation where a math
> instruction has a non-zero subreg_offset.
>
The SIMD lowering pass always uses a suboffset of zero for all
non-scalar sources of the lowered instructions.  I'm not aware of any
case in which that could be a problem.

> But although I don't think it's necessary for correctness, but I think
> it should help with register coalescing. For example, take
> pack/unpackDouble2x32. I've implemented it in terms of two strided
> MOV's, which need to be split by this pass. Say we have something
> like:
>
> uint hi = unpackDouble2x32(input).y;
>
> in SIMD16 mode, this turns into:
>
> mov(16) hi.1<2>:UD input:DF
>
I don't think this is a valid implementation of unpackDouble2x32.  
The partial write and offset destination shouldn't be necessary and as
you've already noticed it will prevent a number of optimization passes
From doing anything useful with your instruction.

I think you'd want something like (pre-SIMD lowering):

mov(16) lo<1>:UD input.0<16;8,2>:UD
mov(16) hi<1>:UD input.1<16;8,2>:UD

> But this needs to be split, and after the splitting pass (without
> fixing up the subreg_offsest like this patch does) and lowering away
> all the LOAD_PAYLOAD's, it gets turned into something like:
>
> mov(8) temp1:DF input:DF 1Q
> mov(8) temp2:DF input+1.0:DF 2Q
> mov(8) temp3<2>:UD temp1:DF 1Q
> mov(8) temp4<2>:UD temp2:DF 2Q
> mov(8) hi+0.1<2>:UD temp3<2>:UD WE_all //The WE_all is due to a hack
> in the SIMD lowering code
> mov(8) hi+1.1<2>:UD temp4<2>:UD WE_all
>
> Copy prop should have no problem cleaning up the first two MOV's, but
> the last two are more of an issue -- we currently don't have a pass
> that can detect that you can offset all reads/writes of temp3 and
> temp4 by 1 and coalesce temp3 and temp4 into hi and hi+1, and it would
> be really hard to write something like that without SSA. But if we
> pass through the offset while splitting it up, then things match up:
>
> mov(8) temp1:DF input:DF 1Q
> mov(8) temp2:DF input+1.0:DF 2Q
> mov(8) temp3+0.1<2>:UD temp1:DF 1Q
> mov(8) temp4+0.1<2>:UD temp2:DF 2Q
> mov(8) hi+0.1<2>:UD temp3+0.1<2>:UD WE_all
> mov(8) hi+1.1<2>:UD temp4+0.1<2>:UD WE_all
>
> and we should be able to coalesce away all the extra MOV's.
>

I doubt this will be any easier to simplify than your first snippet.
You still have strided and offset writes of temp3-4 and hi, so AFAIK
both copy propagation and register coalesce are going to freak out in
both cases.  IMO we want to avoid partial writes as much as possible,
and introduce them as late as possible in the optimization loop.  I
think it would be easiest if the visitor wouldn't emit any partial
writes at all -- even, say, on CHV/BXT with the register alignment
limitation on instructions with 64-bit execution types in place.  Later
on we could run a register alignment lowering pass that would legalize
the strides where they aren't supported natively by the hardware, that
way you would:

 - Be able to implement these restrictions in one place rather than
   having to take them into account anywhere you emit or manipulate
   instructions of 64-bit execution type.

 - Emit partial writes in cases where you hit some hardware limitation
   *only* -- As you're doing it here the SIMD lowering pass will emit
   partial writes for all instructions that have a strided source (and
   emit additional and most often unnecessary partial writes for
   instructions that are already a partial write).  This breaks the
   assumption of this pass that the zipping and unzipping copies are
   cheap to copy-propagate from, and will definitely increase the
   instruction count when this lowering pass is used, regardless of
   whether the lowered instructions actually have any of the hardware
   limitations you are trying to work around.

And BTW, this patch doesn't take into account the larger number of VGRFs
you would need to hold the temporaries in strided and offset form.

> I should probably put some of this in the commit message...
>
> Also, I realized that this patch should really be split in two, since
> the fixes to register renaming are independent of this -- whoops!
>
>>
>>> +                  split_inst.src[j].stride = inst->src[j].stride;
>>>                    emit_transpose(lbld.group(copy_width, 0),
>>>                                   split_inst.src[j], &src, 1, src_size, n);
>>>                 }
>>> @@ -4343,8 +4347,10 @@ fs_visitor::lower_simd_width()
>>>                 /* Allocate enough space to hold the result of the lowered
>>>                  * instruction and fix up the number of registers written.
>>>                  */
>>> -               split_inst.dst = dsts[i] =
>>> -                  lbld.vgrf(inst->dst.type, dst_size);
>>> +               fs_reg dst = lbld.vgrf(inst->dst.type, dst_size);
>>> +               dst.stride = inst->dst.stride;
>>> +               dst.subreg_offset = inst->dst.subreg_offset;
>>> +               split_inst.dst = dsts[i] = dst;
>>>                 split_inst.regs_written =
>>>                    DIV_ROUND_UP(inst->regs_written * lower_width,
>>>                                 inst->exec_size);
>>> --
>>> 2.4.3
>>>
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20150817/ef01c05e/attachment.sig>