[Mesa-dev] [PATCH 09/12] i965/fs: make SIMD-splitting respect the original stride/offset

Tue Aug 18 09:53:07 PDT 2015

On Tue, Aug 18, 2015 at 2:29 AM, Francisco Jerez <currojerez at riseup.net> wrote:
> Connor Abbott <cwabbott0 at gmail.com> writes:
>
>> On Mon, Aug 17, 2015 at 2:41 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>>> Connor Abbott <cwabbott0 at gmail.com> writes:
>>>
>>>> On Mon, Aug 17, 2015 at 12:09 AM, Pohjolainen, Topi
>>>> <topi.pohjolainen at intel.com> wrote:
>>>>> On Fri, Aug 14, 2015 at 03:30:18PM -0700, Connor Abbott wrote:
>>>>>> In some cases, we need to emit ALU instructions with a certain stride
>>>>>> due to a HW limitation. When splitting that instruction, we need to
>>>>>> respect the original stride when creating the temporaries we load from
>>>>>> and store into. Otherwise, we'll reintroduce the problem we were trying
>>>>>> to work around.
>>>>>>
>>>>>> Signed-off-by: Connor Abbott <connor.w.abbott at intel.com>
>>>>>> ---
>>>>>>  src/mesa/drivers/dri/i965/brw_fs.cpp | 14 ++++++++++----
>>>>>>  1 file changed, 10 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>>>> index 812648f..386e9a2 100644
>>>>>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>>>> @@ -2370,12 +2370,14 @@ fs_visitor::opt_register_renaming()
>>>>>>
>>>>>>        if (depth == 0 &&
>>>>>>            inst->dst.file == GRF &&
>>>>>> -          alloc.sizes[inst->dst.reg] == inst->exec_size / 8 &&
>>>>>> +          alloc.sizes[inst->dst.reg] ==
>>>>>> +            inst->dst.component_size(inst->exec_size) &&
>>>>>>            !inst->is_partial_write()) {
>>>>>>           if (remap[dst] == -1) {
>>>>>>              remap[dst] = dst;
>>>>>>           } else {
>>>>>> -            remap[dst] = alloc.allocate(inst->exec_size / 8);
>>>>>> +            remap[dst] =
>>>>>> +               alloc.allocate(inst->dst.component_size(inst->exec_size));
>>>>>>              inst->dst.reg = remap[dst];
>>>>>>              progress = true;
>>>>>>           }
>>>>>> @@ -4334,6 +4336,8 @@ fs_visitor::lower_simd_width()
>>>>>>                     * temporary passed as source to the lowered instruction.
>>>>>>                     */
>>>>>>                    split_inst.src[j] = lbld.vgrf(inst->src[j].type, src_size);
>>>>>> +                  split_inst.src[j].subreg_offset = inst->src[j].subreg_offset;
>>>>>
>>>>> The fixes for the wider component size (64-bits) and the stride are clearly
>>>>> needed for doubles. I'm wondering though why the sub-register offset hasn't
>>>>> caused us any problems before. That change is not needed just for doubles,
>>>>> is it?
>>>>
>>>> I think we haven't noticed this before for two reasons:
>>>>
>>>> 1. It usually isn't needed for correctness.
>>>> 2. We don't really use subreg_offset that much anyways.
>>>>
>>>> The splitting code is moving the inputs into a bunch of temporary
>>>> registers, doing the split operation into a new bunch of temporaries,
>>>> and then moving those temporaries back into the destination register.
>>>> It's copying over the stride and subreg_offset fields from the
>>>> original inputs and outputs, so when all is said and done, it's
>>>> producing the same result and using the same sources. The difference
>>>> is that the split instructions themselves weren't copying over those
>>>> fields, meaning that they were always operating on packed registers
>>>> (stride of 1 and subreg_offset of 0). This is normally ok, except
>>>> where, due to a restriction, the split instruction *must* have the
>>>> same stride/offset as the original instruction. For fp64, DF->F
>>>> conversion must preserve the stride since we have to use a stride of 2
>>>> for the output. The only thing I can think of that has to preserve the
>>>> subreg_offset is math instructions, since the source offset has to be
>>>> the same as the destination offset, but that should be a corner case
>>>> and I'm not sure whether we'll ever get in a situation where a math
>>>> instruction has a non-zero subreg_offset.
>>>>
>>> The SIMD lowering pass always uses a suboffset of zero for all
>>> non-scalar sources of the lowered instructions.  I'm not aware of any
>>> case in which that could be a problem.
>>>
>>>> But although I don't think it's necessary for correctness, but I think
>>>> it should help with register coalescing. For example, take
>>>> pack/unpackDouble2x32. I've implemented it in terms of two strided
>>>> MOV's, which need to be split by this pass. Say we have something
>>>> like:
>>>>
>>>> uint hi = unpackDouble2x32(input).y;
>>>>
>>>> in SIMD16 mode, this turns into:
>>>>
>>>> mov(16) hi.1<2>:UD input:DF
>>>>
>>> I don't think this is a valid implementation of unpackDouble2x32.
>>> The partial write and offset destination shouldn't be necessary and as
>>> you've already noticed it will prevent a number of optimization passes
>>> From doing anything useful with your instruction.
>>
>> Right, I had my example backwards. It's packDouble2x32 that you have
>> to emit a MOV like that for. It would be something like:
>>
>> mov(16) output.1<2>:UD input_y:UD
>>
>
> And how exactly does it help the optimizer to introduce additional
> partial writes?  AFAICT this should already be simplified without any
> changes to the current copy propagation pass:
>
> | mov(8) temp1:UD input:UD 1Q
> | mov(8) temp2:UD input+1.0:UD 2Q
> | mov(8) temp3<1>:UD temp1:UD 1Q
> | mov(8) temp4<1>:UD temp2:UD 2Q
> | mov(8) output+0.1<2>:UD temp3:UD WE_all
> | mov(8) output+1.1<2>:UD temp4:UD WE_all
>
> into:
>
> | mov(8) output+0.1<2>:UD input:UD WE_all
> | mov(8) output+1.1<2>:UD input+1.0:UD WE_all

I think you're right, since we happen to be dealing with MOV's that
can be coalesed away.

>
> However what you seem to be suggesting:
>
> | mov(8) temp1:UD input:UD 1Q
> | mov(8) temp2:UD input+1.0:UD 2Q
> | mov(8) temp3+0.1<2>:UD temp1:UD 1Q
> | mov(8) temp4+0.1<2>:UD temp2:UD 2Q
> | mov(8) output+0.1<2>:UD temp3+0.1<2>:UD WE_all
> | mov(8) output+1.1<2>:UD temp4+0.1<2>:UD WE_all
>
> Is unlikely to be simplified further than:
>
> | mov(8) temp3+0.1<2>:UD input:UD 1Q
> | mov(8) temp4+0.1<2>:UD input+1.0:UD 2Q
> | mov(8) output+0.1<2>:UD temp3+0.1<2>:UD WE_all
> | mov(8) output+1.1<2>:UD temp4+0.1<2>:UD WE_all
>
> without introducing a whole new class of copy propagation via partial
> writes, which is going to be dubiously useful in the long run if we're
> planning to migrate the compiler back-end to SSA eventually.
>
> I think ideally you would implement packDouble2x32 using an SSA-frendly
> pseudo-instruction, like:
>
> | PACK(16) dst:UQ src0:UD src1:UD
>
> It would be cool if it could take as many sources as required to fill
> the destination type, e.g. it would be useful for the image_load_store
> to be able to do:
>
> | PACK(16) dst:UD src0:UB src1:UB src2:UB src3:UB
>
> It could probably replace the similar but less general VEC4 instruction.
> This is currently handled in the image packing and unpacking code using
> bitwise arithmetic precisely to avoid partial writes.

Yeah, we'll definitely need something like that when we get to
SSA-ifying things. I don't see any particular benefits that it gives
us until then, though, and you haven't pointed out any. There are so
many things we want/need to change with the move to SSA:

- Fixing up the frontend to do predicated things with select
instructions, and coalescing those away.
- Doing SSA with flag registers, so that we can more freely move CMP
instructions around and take advantage of all the flag registers
instead of hardcoding one for discard.
- Having the type, quarter control, and exec_all be on the SSA
destination and validating that things are consistent so that figuring
out interference is easier and we can have a LOAD_PAYLOAD that has
different sources with different quarter controls.
- Handling strides like what you suggest, where there are no partial
writes and no strides and perhaps only limited partial reads through
retyping sources to a smaller type (not clear whether having this as a
dedicated opcode is easier), and there's a separate pass that figures
out what the stride should be based on HW restrictions.
- Doing SSA-based register allocation and somehow dealing with
incompatible exec_mask's, and as a consequence, actually doing real
spilling based on register pressure.
- Probably something I forgot.

and so many of them are interconnected and won't even make sense until
we have SSA, that it's not always useful to think about the IR we want
to have in the distant future in order to decide what to do today.
We've talked about scrapping FS and completely re-writing it, and even
if we don't, there's going to be a *massive* amount of churn. I think
you've managed to convince me that having a separate lowering pass
that deals with the DF->F workaround and dropping this patch is
useful, since it avoids having to do copy propagation with arbitrary
strides/offsets, but unless there's some other practical benefit, I'm
not inclined to add more stuff that will only be useful for SSA. If
the SSA-based thing is going to be useful today, then sure, let's do
it, but otherwise it's just adding dead code.

>
>>>
>>> I think you'd want something like (pre-SIMD lowering):
>>>
>>> mov(16) lo<1>:UD input.0<16;8,2>:UD
>>> mov(16) hi<1>:UD input.1<16;8,2>:UD
>>>
>>>> But this needs to be split, and after the splitting pass (without
>>>> fixing up the subreg_offsest like this patch does) and lowering away
>>>> all the LOAD_PAYLOAD's, it gets turned into something like:
>>>>
>>>> mov(8) temp1:DF input:DF 1Q
>>>> mov(8) temp2:DF input+1.0:DF 2Q
>>>> mov(8) temp3<2>:UD temp1:DF 1Q
>>>> mov(8) temp4<2>:UD temp2:DF 2Q
>>>> mov(8) hi+0.1<2>:UD temp3<2>:UD WE_all //The WE_all is due to a hack
>>>> in the SIMD lowering code
>>>> mov(8) hi+1.1<2>:UD temp4<2>:UD WE_all
>>>>
>>>> Copy prop should have no problem cleaning up the first two MOV's, but
>>>> the last two are more of an issue -- we currently don't have a pass
>>>> that can detect that you can offset all reads/writes of temp3 and
>>>> temp4 by 1 and coalesce temp3 and temp4 into hi and hi+1, and it would
>>>> be really hard to write something like that without SSA. But if we
>>>> pass through the offset while splitting it up, then things match up:
>>>>
>>>> mov(8) temp1:DF input:DF 1Q
>>>> mov(8) temp2:DF input+1.0:DF 2Q
>>>> mov(8) temp3+0.1<2>:UD temp1:DF 1Q
>>>> mov(8) temp4+0.1<2>:UD temp2:DF 2Q
>>>> mov(8) hi+0.1<2>:UD temp3+0.1<2>:UD WE_all
>>>> mov(8) hi+1.1<2>:UD temp4+0.1<2>:UD WE_all
>>>>
>>>> and we should be able to coalesce away all the extra MOV's.
>>>>
>>>
>>> I doubt this will be any easier to simplify than your first snippet.
>>> You still have strided and offset writes of temp3-4 and hi, so AFAIK
>>> both copy propagation and register coalesce are going to freak out in
>>> both cases.
>>
>> It should be quite easy to handle this in copy propagation, though --
>> allow partial writes (although consider them to write to the whole
>> register for simplicity) and only propagate if stride/offset match.
>> It's effectively doing the (more difficult) first part of coalescing
>> away the MOV, changing the offset and stride of temp3, for you.
>>
> I don't think it's about it being difficult to implement...  In fact a
> couple of weeks ago I changed the copy propagation pass to handle
> arbitrary strided copies.  But after some thought I concluded it wasn't
> worth sending for review because we probably want to eliminate most
> cases of partial writes from the back-end in the long run anyway, so it
> seemed like unnecessary complexity.
>
>>> IMO we want to avoid partial writes as much as possible,
>>> and introduce them as late as possible in the optimization loop.  I
>>> think it would be easiest if the visitor wouldn't emit any partial
>>> writes at all -- even, say, on CHV/BXT with the register alignment
>>> limitation on instructions with 64-bit execution types in place.  Later
>>> on we could run a register alignment lowering pass that would legalize
>>> the strides where they aren't supported natively by the hardware, that
>>> way you would:
>>>
>>>  - Be able to implement these restrictions in one place rather than
>>>    having to take them into account anywhere you emit or manipulate
>>>    instructions of 64-bit execution type.
>>>
>>>  - Emit partial writes in cases where you hit some hardware limitation
>>>    *only* -- As you're doing it here the SIMD lowering pass will emit
>>>    partial writes for all instructions that have a strided source (and
>>>    emit additional and most often unnecessary partial writes for
>>>    instructions that are already a partial write).  This breaks the
>>>    assumption of this pass that the zipping and unzipping copies are
>>>    cheap to copy-propagate from, and will definitely increase the
>>>    instruction count when this lowering pass is used, regardless of
>>>    whether the lowered instructions actually have any of the hardware
>>>    limitations you are trying to work around.
>>
>> I think we can just drop the part that sets the source stride. AFAIK
>> there aren't any double operations I'm emitting that need a source
>> with a special stride due to HW limitations; I just put it there for
>> consistency. I don't think it's actually used, and as you said it's
>> probably harmful. We still need to keep it around for the
>> destinations, since there we're going to emit partial writes anyways,
>> so we might as well emit partial writes that we have a chance to
>> coalesce. We could add a special lowering pass for DF->F, since that's
>> the only operation that requires special handling wrt strides, but
>> that still doesn't solve the coalescing problem with packDouble2x32.
>
> As I said earlier this should probably be simplified by copy
> propagation.
>
>> It also won't solve most of the problems you mentioned, since when we
>> do the lowering to create the partial write, we're still going to emit
>> a second MOV, and we're still going to want to coalesce it away for
>> everything where we can, so copy propagation still needs to understand
>> all the stride restrictions, and those restrictions are where most of
>> the fixes were.
>
> But you do realize that it's only because of your proposed "fix" for
> copy propagation to propagate via partial writes that it's then going to
> un-do your lowered strides what then brings you to have to make it aware
> of these hardware restrictions -- Otherwise if you just implemented them
> in a lowering pass and then ran copy propagation it wouldn't even
> consider the partial writes used to legalize the strides as valid
> propagation candidates AFAIK.

No, we still need to make copy progagation aware of these
restrictions, even with the lowering pass. For example, say you have
something like:

MOV(8) temp:F input:DF
SQRT(8) out1:F temp:F //some math instruction that uses temp
ADD(8) out2:F temp:F foo:F //some normal instruction that uses temp

This is going to get legalized to:

MOV(8) temp2<2>:F input:DF
MOV(8) temp:F temp2<2>:F
SQRT(8) out1:F temp:F //some math instruction that uses temp
ADD(8) out2:F temp:F foo:F //some normal instruction that uses temp

Now, we want to copy-propagate into the normal instruction but not the
math instruction. Either we could make copy-propagation aware of this,
like my patch does, or we could add yet another pass that legalizes
math instructions' strides (and sometimes produces code that could be
CSE'd). But you can't do both in the same pass, since it would then
fight with copy propagation. Given the two options, fixing copy
propagation seems easier to me.

>
>> I didn't have to fix anything to understand that a DF->F move needs to
>> have a destination stride of 2, besides this code -- nothing else
>> touches it, so the places where I needed to take stride into account
>> when manipulating instructions of 64-bit type were here and when
>> emitting the special MOV for DF->F. So in the end, adding a stride
>> legalization pass would be more code than what I have now, and for not
>> much benefit.
>>
>>>
>>> And BTW, this patch doesn't take into account the larger number of VGRFs
>>> you would need to hold the temporaries in strided and offset form.
>>
>> It does for destinations but not sources, since we were already using
>> regs_written for destinations and over-allocating anyways.
>>
> No, it's not, it allocates just enough space for the number of vector
> components times the lowered width.
>
>>>
>>>> I should probably put some of this in the commit message...
>>>>
>>>> Also, I realized that this patch should really be split in two, since
>>>> the fixes to register renaming are independent of this -- whoops!
>>>>
>>>>>
>>>>>> +                  split_inst.src[j].stride = inst->src[j].stride;
>>>>>>                    emit_transpose(lbld.group(copy_width, 0),
>>>>>>                                   split_inst.src[j], &src, 1, src_size, n);
>>>>>>                 }
>>>>>> @@ -4343,8 +4347,10 @@ fs_visitor::lower_simd_width()
>>>>>>                 /* Allocate enough space to hold the result of the lowered
>>>>>>                  * instruction and fix up the number of registers written.
>>>>>>                  */
>>>>>> -               split_inst.dst = dsts[i] =
>>>>>> -                  lbld.vgrf(inst->dst.type, dst_size);
>>>>>> +               fs_reg dst = lbld.vgrf(inst->dst.type, dst_size);
>>>>>> +               dst.stride = inst->dst.stride;
>>>>>> +               dst.subreg_offset = inst->dst.subreg_offset;
>>>>>> +               split_inst.dst = dsts[i] = dst;
>>>>>>                 split_inst.regs_written =
>>>>>>                    DIV_ROUND_UP(inst->regs_written * lower_width,
>>>>>>                                 inst->exec_size);
>>>>>> --
>>>>>> 2.4.3
>>>>>>
>>>>>> _______________________________________________
>>>>>> mesa-dev mailing list
>>>>>> mesa-dev at lists.freedesktop.org
>>>>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>> _______________________________________________
>>>> mesa-dev mailing list
>>>> mesa-dev at lists.freedesktop.org
>>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev