[Mesa-dev] [PATCH 09/12] i965/fs: make SIMD-splitting respect the original stride/offset

Mon Aug 17 09:47:53 PDT 2015

On Mon, Aug 17, 2015 at 2:41 AM, Francisco Jerez <currojerez at riseup.net> wrote:
> Connor Abbott <cwabbott0 at gmail.com> writes:
>
>> On Mon, Aug 17, 2015 at 12:09 AM, Pohjolainen, Topi
>> <topi.pohjolainen at intel.com> wrote:
>>> On Fri, Aug 14, 2015 at 03:30:18PM -0700, Connor Abbott wrote:
>>>> In some cases, we need to emit ALU instructions with a certain stride
>>>> due to a HW limitation. When splitting that instruction, we need to
>>>> respect the original stride when creating the temporaries we load from
>>>> and store into. Otherwise, we'll reintroduce the problem we were trying
>>>> to work around.
>>>>
>>>> Signed-off-by: Connor Abbott <connor.w.abbott at intel.com>
>>>> ---
>>>>  src/mesa/drivers/dri/i965/brw_fs.cpp | 14 ++++++++++----
>>>>  1 file changed, 10 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>> index 812648f..386e9a2 100644
>>>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>> @@ -2370,12 +2370,14 @@ fs_visitor::opt_register_renaming()
>>>>
>>>>        if (depth == 0 &&
>>>>            inst->dst.file == GRF &&
>>>> -          alloc.sizes[inst->dst.reg] == inst->exec_size / 8 &&
>>>> +          alloc.sizes[inst->dst.reg] ==
>>>> +            inst->dst.component_size(inst->exec_size) &&
>>>>            !inst->is_partial_write()) {
>>>>           if (remap[dst] == -1) {
>>>>              remap[dst] = dst;
>>>>           } else {
>>>> -            remap[dst] = alloc.allocate(inst->exec_size / 8);
>>>> +            remap[dst] =
>>>> +               alloc.allocate(inst->dst.component_size(inst->exec_size));
>>>>              inst->dst.reg = remap[dst];
>>>>              progress = true;
>>>>           }
>>>> @@ -4334,6 +4336,8 @@ fs_visitor::lower_simd_width()
>>>>                     * temporary passed as source to the lowered instruction.
>>>>                     */
>>>>                    split_inst.src[j] = lbld.vgrf(inst->src[j].type, src_size);
>>>> +                  split_inst.src[j].subreg_offset = inst->src[j].subreg_offset;
>>>
>>> The fixes for the wider component size (64-bits) and the stride are clearly
>>> needed for doubles. I'm wondering though why the sub-register offset hasn't
>>> caused us any problems before. That change is not needed just for doubles,
>>> is it?
>>
>> I think we haven't noticed this before for two reasons:
>>
>> 1. It usually isn't needed for correctness.
>> 2. We don't really use subreg_offset that much anyways.
>>
>> The splitting code is moving the inputs into a bunch of temporary
>> registers, doing the split operation into a new bunch of temporaries,
>> and then moving those temporaries back into the destination register.
>> It's copying over the stride and subreg_offset fields from the
>> original inputs and outputs, so when all is said and done, it's
>> producing the same result and using the same sources. The difference
>> is that the split instructions themselves weren't copying over those
>> fields, meaning that they were always operating on packed registers
>> (stride of 1 and subreg_offset of 0). This is normally ok, except
>> where, due to a restriction, the split instruction *must* have the
>> same stride/offset as the original instruction. For fp64, DF->F
>> conversion must preserve the stride since we have to use a stride of 2
>> for the output. The only thing I can think of that has to preserve the
>> subreg_offset is math instructions, since the source offset has to be
>> the same as the destination offset, but that should be a corner case
>> and I'm not sure whether we'll ever get in a situation where a math
>> instruction has a non-zero subreg_offset.
>>
> The SIMD lowering pass always uses a suboffset of zero for all
> non-scalar sources of the lowered instructions.  I'm not aware of any
> case in which that could be a problem.
>
>> But although I don't think it's necessary for correctness, but I think
>> it should help with register coalescing. For example, take
>> pack/unpackDouble2x32. I've implemented it in terms of two strided
>> MOV's, which need to be split by this pass. Say we have something
>> like:
>>
>> uint hi = unpackDouble2x32(input).y;
>>
>> in SIMD16 mode, this turns into:
>>
>> mov(16) hi.1<2>:UD input:DF
>>
> I don't think this is a valid implementation of unpackDouble2x32.
> The partial write and offset destination shouldn't be necessary and as
> you've already noticed it will prevent a number of optimization passes
> From doing anything useful with your instruction.

Right, I had my example backwards. It's packDouble2x32 that you have
to emit a MOV like that for. It would be something like:

mov(16) output.1<2>:UD input_y:UD

>
> I think you'd want something like (pre-SIMD lowering):
>
> mov(16) lo<1>:UD input.0<16;8,2>:UD
> mov(16) hi<1>:UD input.1<16;8,2>:UD
>
>> But this needs to be split, and after the splitting pass (without
>> fixing up the subreg_offsest like this patch does) and lowering away
>> all the LOAD_PAYLOAD's, it gets turned into something like:
>>
>> mov(8) temp1:DF input:DF 1Q
>> mov(8) temp2:DF input+1.0:DF 2Q
>> mov(8) temp3<2>:UD temp1:DF 1Q
>> mov(8) temp4<2>:UD temp2:DF 2Q
>> mov(8) hi+0.1<2>:UD temp3<2>:UD WE_all //The WE_all is due to a hack
>> in the SIMD lowering code
>> mov(8) hi+1.1<2>:UD temp4<2>:UD WE_all
>>
>> Copy prop should have no problem cleaning up the first two MOV's, but
>> the last two are more of an issue -- we currently don't have a pass
>> that can detect that you can offset all reads/writes of temp3 and
>> temp4 by 1 and coalesce temp3 and temp4 into hi and hi+1, and it would
>> be really hard to write something like that without SSA. But if we
>> pass through the offset while splitting it up, then things match up:
>>
>> mov(8) temp1:DF input:DF 1Q
>> mov(8) temp2:DF input+1.0:DF 2Q
>> mov(8) temp3+0.1<2>:UD temp1:DF 1Q
>> mov(8) temp4+0.1<2>:UD temp2:DF 2Q
>> mov(8) hi+0.1<2>:UD temp3+0.1<2>:UD WE_all
>> mov(8) hi+1.1<2>:UD temp4+0.1<2>:UD WE_all
>>
>> and we should be able to coalesce away all the extra MOV's.
>>
>
> I doubt this will be any easier to simplify than your first snippet.
> You still have strided and offset writes of temp3-4 and hi, so AFAIK
> both copy propagation and register coalesce are going to freak out in
> both cases.

It should be quite easy to handle this in copy propagation, though --
allow partial writes (although consider them to write to the whole
register for simplicity) and only propagate if stride/offset match.
It's effectively doing the (more difficult) first part of coalescing
away the MOV, changing the offset and stride of temp3, for you.

> IMO we want to avoid partial writes as much as possible,
> and introduce them as late as possible in the optimization loop.  I
> think it would be easiest if the visitor wouldn't emit any partial
> writes at all -- even, say, on CHV/BXT with the register alignment
> limitation on instructions with 64-bit execution types in place.  Later
> on we could run a register alignment lowering pass that would legalize
> the strides where they aren't supported natively by the hardware, that
> way you would:
>
>  - Be able to implement these restrictions in one place rather than
>    having to take them into account anywhere you emit or manipulate
>    instructions of 64-bit execution type.
>
>  - Emit partial writes in cases where you hit some hardware limitation
>    *only* -- As you're doing it here the SIMD lowering pass will emit
>    partial writes for all instructions that have a strided source (and
>    emit additional and most often unnecessary partial writes for
>    instructions that are already a partial write).  This breaks the
>    assumption of this pass that the zipping and unzipping copies are
>    cheap to copy-propagate from, and will definitely increase the
>    instruction count when this lowering pass is used, regardless of
>    whether the lowered instructions actually have any of the hardware
>    limitations you are trying to work around.

I think we can just drop the part that sets the source stride. AFAIK
there aren't any double operations I'm emitting that need a source
with a special stride due to HW limitations; I just put it there for
consistency. I don't think it's actually used, and as you said it's
probably harmful. We still need to keep it around for the
destinations, since there we're going to emit partial writes anyways,
so we might as well emit partial writes that we have a chance to
coalesce. We could add a special lowering pass for DF->F, since that's
the only operation that requires special handling wrt strides, but
that still doesn't solve the coalescing problem with packDouble2x32.
It also won't solve most of the problems you mentioned, since when we
do the lowering to create the partial write, we're still going to emit
a second MOV, and we're still going to want to coalesce it away for
everything where we can, so copy propagation still needs to understand
all the stride restrictions, and those restrictions are where most of
the fixes were. I didn't have to fix anything to understand that a
DF->F move needs to have a destination stride of 2, besides this code
-- nothing else touches it, so the places where I needed to take
stride into account when manipulating instructions of 64-bit type were
here and when emitting the special MOV for DF->F. So in the end,
adding a stride legalization pass would be more code than what I have
now, and for not much benefit.

>
> And BTW, this patch doesn't take into account the larger number of VGRFs
> you would need to hold the temporaries in strided and offset form.

It does for destinations but not sources, since we were already using
regs_written for destinations and over-allocating anyways.

>
>> I should probably put some of this in the commit message...
>>
>> Also, I realized that this patch should really be split in two, since
>> the fixes to register renaming are independent of this -- whoops!
>>
>>>
>>>> +                  split_inst.src[j].stride = inst->src[j].stride;
>>>>                    emit_transpose(lbld.group(copy_width, 0),
>>>>                                   split_inst.src[j], &src, 1, src_size, n);
>>>>                 }
>>>> @@ -4343,8 +4347,10 @@ fs_visitor::lower_simd_width()
>>>>                 /* Allocate enough space to hold the result of the lowered
>>>>                  * instruction and fix up the number of registers written.
>>>>                  */
>>>> -               split_inst.dst = dsts[i] =
>>>> -                  lbld.vgrf(inst->dst.type, dst_size);
>>>> +               fs_reg dst = lbld.vgrf(inst->dst.type, dst_size);
>>>> +               dst.stride = inst->dst.stride;
>>>> +               dst.subreg_offset = inst->dst.subreg_offset;
>>>> +               split_inst.dst = dsts[i] = dst;
>>>>                 split_inst.regs_written =
>>>>                    DIV_ROUND_UP(inst->regs_written * lower_width,
>>>>                                 inst->exec_size);
>>>> --
>>>> 2.4.3
>>>>
>>>> _______________________________________________
>>>> mesa-dev mailing list
>>>> mesa-dev at lists.freedesktop.org
>>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev