[Mesa-dev] [PATCH 09/12] i965/fs: make SIMD-splitting respect the original stride/offset

Tue Aug 18 02:29:18 PDT 2015

Connor Abbott <cwabbott0 at gmail.com> writes:

> On Mon, Aug 17, 2015 at 2:41 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>> Connor Abbott <cwabbott0 at gmail.com> writes:
>>
>>> On Mon, Aug 17, 2015 at 12:09 AM, Pohjolainen, Topi
>>> <topi.pohjolainen at intel.com> wrote:
>>>> On Fri, Aug 14, 2015 at 03:30:18PM -0700, Connor Abbott wrote:
>>>>> In some cases, we need to emit ALU instructions with a certain stride
>>>>> due to a HW limitation. When splitting that instruction, we need to
>>>>> respect the original stride when creating the temporaries we load from
>>>>> and store into. Otherwise, we'll reintroduce the problem we were trying
>>>>> to work around.
>>>>>
>>>>> Signed-off-by: Connor Abbott <connor.w.abbott at intel.com>
>>>>> ---
>>>>>  src/mesa/drivers/dri/i965/brw_fs.cpp | 14 ++++++++++----
>>>>>  1 file changed, 10 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>>> index 812648f..386e9a2 100644
>>>>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>>> @@ -2370,12 +2370,14 @@ fs_visitor::opt_register_renaming()
>>>>>
>>>>>        if (depth == 0 &&
>>>>>            inst->dst.file == GRF &&
>>>>> -          alloc.sizes[inst->dst.reg] == inst->exec_size / 8 &&
>>>>> +          alloc.sizes[inst->dst.reg] ==
>>>>> +            inst->dst.component_size(inst->exec_size) &&
>>>>>            !inst->is_partial_write()) {
>>>>>           if (remap[dst] == -1) {
>>>>>              remap[dst] = dst;
>>>>>           } else {
>>>>> -            remap[dst] = alloc.allocate(inst->exec_size / 8);
>>>>> +            remap[dst] =
>>>>> +               alloc.allocate(inst->dst.component_size(inst->exec_size));
>>>>>              inst->dst.reg = remap[dst];
>>>>>              progress = true;
>>>>>           }
>>>>> @@ -4334,6 +4336,8 @@ fs_visitor::lower_simd_width()
>>>>>                     * temporary passed as source to the lowered instruction.
>>>>>                     */
>>>>>                    split_inst.src[j] = lbld.vgrf(inst->src[j].type, src_size);
>>>>> +                  split_inst.src[j].subreg_offset = inst->src[j].subreg_offset;
>>>>
>>>> The fixes for the wider component size (64-bits) and the stride are clearly
>>>> needed for doubles. I'm wondering though why the sub-register offset hasn't
>>>> caused us any problems before. That change is not needed just for doubles,
>>>> is it?
>>>
>>> I think we haven't noticed this before for two reasons:
>>>
>>> 1. It usually isn't needed for correctness.
>>> 2. We don't really use subreg_offset that much anyways.
>>>
>>> The splitting code is moving the inputs into a bunch of temporary
>>> registers, doing the split operation into a new bunch of temporaries,
>>> and then moving those temporaries back into the destination register.
>>> It's copying over the stride and subreg_offset fields from the
>>> original inputs and outputs, so when all is said and done, it's
>>> producing the same result and using the same sources. The difference
>>> is that the split instructions themselves weren't copying over those
>>> fields, meaning that they were always operating on packed registers
>>> (stride of 1 and subreg_offset of 0). This is normally ok, except
>>> where, due to a restriction, the split instruction *must* have the
>>> same stride/offset as the original instruction. For fp64, DF->F
>>> conversion must preserve the stride since we have to use a stride of 2
>>> for the output. The only thing I can think of that has to preserve the
>>> subreg_offset is math instructions, since the source offset has to be
>>> the same as the destination offset, but that should be a corner case
>>> and I'm not sure whether we'll ever get in a situation where a math
>>> instruction has a non-zero subreg_offset.
>>>
>> The SIMD lowering pass always uses a suboffset of zero for all
>> non-scalar sources of the lowered instructions.  I'm not aware of any
>> case in which that could be a problem.
>>
>>> But although I don't think it's necessary for correctness, but I think
>>> it should help with register coalescing. For example, take
>>> pack/unpackDouble2x32. I've implemented it in terms of two strided
>>> MOV's, which need to be split by this pass. Say we have something
>>> like:
>>>
>>> uint hi = unpackDouble2x32(input).y;
>>>
>>> in SIMD16 mode, this turns into:
>>>
>>> mov(16) hi.1<2>:UD input:DF
>>>
>> I don't think this is a valid implementation of unpackDouble2x32.
>> The partial write and offset destination shouldn't be necessary and as
>> you've already noticed it will prevent a number of optimization passes
>> From doing anything useful with your instruction.
>
> Right, I had my example backwards. It's packDouble2x32 that you have
> to emit a MOV like that for. It would be something like:
>
> mov(16) output.1<2>:UD input_y:UD
>

And how exactly does it help the optimizer to introduce additional
partial writes?  AFAICT this should already be simplified without any
changes to the current copy propagation pass:

| mov(8) temp1:UD input:UD 1Q
| mov(8) temp2:UD input+1.0:UD 2Q
| mov(8) temp3<1>:UD temp1:UD 1Q
| mov(8) temp4<1>:UD temp2:UD 2Q
| mov(8) output+0.1<2>:UD temp3:UD WE_all
| mov(8) output+1.1<2>:UD temp4:UD WE_all

into:

| mov(8) output+0.1<2>:UD input:UD WE_all
| mov(8) output+1.1<2>:UD input+1.0:UD WE_all

However what you seem to be suggesting:

| mov(8) temp1:UD input:UD 1Q
| mov(8) temp2:UD input+1.0:UD 2Q
| mov(8) temp3+0.1<2>:UD temp1:UD 1Q
| mov(8) temp4+0.1<2>:UD temp2:UD 2Q
| mov(8) output+0.1<2>:UD temp3+0.1<2>:UD WE_all
| mov(8) output+1.1<2>:UD temp4+0.1<2>:UD WE_all

Is unlikely to be simplified further than:

| mov(8) temp3+0.1<2>:UD input:UD 1Q
| mov(8) temp4+0.1<2>:UD input+1.0:UD 2Q
| mov(8) output+0.1<2>:UD temp3+0.1<2>:UD WE_all
| mov(8) output+1.1<2>:UD temp4+0.1<2>:UD WE_all

without introducing a whole new class of copy propagation via partial
writes, which is going to be dubiously useful in the long run if we're
planning to migrate the compiler back-end to SSA eventually.

I think ideally you would implement packDouble2x32 using an SSA-frendly
pseudo-instruction, like:

| PACK(16) dst:UQ src0:UD src1:UD

It would be cool if it could take as many sources as required to fill
the destination type, e.g. it would be useful for the image_load_store
to be able to do:

| PACK(16) dst:UD src0:UB src1:UB src2:UB src3:UB

It could probably replace the similar but less general VEC4 instruction.
This is currently handled in the image packing and unpacking code using
bitwise arithmetic precisely to avoid partial writes.

>>
>> I think you'd want something like (pre-SIMD lowering):
>>
>> mov(16) lo<1>:UD input.0<16;8,2>:UD
>> mov(16) hi<1>:UD input.1<16;8,2>:UD
>>
>>> But this needs to be split, and after the splitting pass (without
>>> fixing up the subreg_offsest like this patch does) and lowering away
>>> all the LOAD_PAYLOAD's, it gets turned into something like:
>>>
>>> mov(8) temp1:DF input:DF 1Q
>>> mov(8) temp2:DF input+1.0:DF 2Q
>>> mov(8) temp3<2>:UD temp1:DF 1Q
>>> mov(8) temp4<2>:UD temp2:DF 2Q
>>> mov(8) hi+0.1<2>:UD temp3<2>:UD WE_all //The WE_all is due to a hack
>>> in the SIMD lowering code
>>> mov(8) hi+1.1<2>:UD temp4<2>:UD WE_all
>>>
>>> Copy prop should have no problem cleaning up the first two MOV's, but
>>> the last two are more of an issue -- we currently don't have a pass
>>> that can detect that you can offset all reads/writes of temp3 and
>>> temp4 by 1 and coalesce temp3 and temp4 into hi and hi+1, and it would
>>> be really hard to write something like that without SSA. But if we
>>> pass through the offset while splitting it up, then things match up:
>>>
>>> mov(8) temp1:DF input:DF 1Q
>>> mov(8) temp2:DF input+1.0:DF 2Q
>>> mov(8) temp3+0.1<2>:UD temp1:DF 1Q
>>> mov(8) temp4+0.1<2>:UD temp2:DF 2Q
>>> mov(8) hi+0.1<2>:UD temp3+0.1<2>:UD WE_all
>>> mov(8) hi+1.1<2>:UD temp4+0.1<2>:UD WE_all
>>>
>>> and we should be able to coalesce away all the extra MOV's.
>>>
>>
>> I doubt this will be any easier to simplify than your first snippet.
>> You still have strided and offset writes of temp3-4 and hi, so AFAIK
>> both copy propagation and register coalesce are going to freak out in
>> both cases.
>
> It should be quite easy to handle this in copy propagation, though --
> allow partial writes (although consider them to write to the whole
> register for simplicity) and only propagate if stride/offset match.
> It's effectively doing the (more difficult) first part of coalescing
> away the MOV, changing the offset and stride of temp3, for you.
>
I don't think it's about it being difficult to implement...  In fact a
couple of weeks ago I changed the copy propagation pass to handle
arbitrary strided copies.  But after some thought I concluded it wasn't
worth sending for review because we probably want to eliminate most
cases of partial writes from the back-end in the long run anyway, so it
seemed like unnecessary complexity.

>> IMO we want to avoid partial writes as much as possible,
>> and introduce them as late as possible in the optimization loop.  I
>> think it would be easiest if the visitor wouldn't emit any partial
>> writes at all -- even, say, on CHV/BXT with the register alignment
>> limitation on instructions with 64-bit execution types in place.  Later
>> on we could run a register alignment lowering pass that would legalize
>> the strides where they aren't supported natively by the hardware, that
>> way you would:
>>
>>  - Be able to implement these restrictions in one place rather than
>>    having to take them into account anywhere you emit or manipulate
>>    instructions of 64-bit execution type.
>>
>>  - Emit partial writes in cases where you hit some hardware limitation
>>    *only* -- As you're doing it here the SIMD lowering pass will emit
>>    partial writes for all instructions that have a strided source (and
>>    emit additional and most often unnecessary partial writes for
>>    instructions that are already a partial write).  This breaks the
>>    assumption of this pass that the zipping and unzipping copies are
>>    cheap to copy-propagate from, and will definitely increase the
>>    instruction count when this lowering pass is used, regardless of
>>    whether the lowered instructions actually have any of the hardware
>>    limitations you are trying to work around.
>
> I think we can just drop the part that sets the source stride. AFAIK
> there aren't any double operations I'm emitting that need a source
> with a special stride due to HW limitations; I just put it there for
> consistency. I don't think it's actually used, and as you said it's
> probably harmful. We still need to keep it around for the
> destinations, since there we're going to emit partial writes anyways,
> so we might as well emit partial writes that we have a chance to
> coalesce. We could add a special lowering pass for DF->F, since that's
> the only operation that requires special handling wrt strides, but
> that still doesn't solve the coalescing problem with packDouble2x32.

As I said earlier this should probably be simplified by copy
propagation.

> It also won't solve most of the problems you mentioned, since when we
> do the lowering to create the partial write, we're still going to emit
> a second MOV, and we're still going to want to coalesce it away for
> everything where we can, so copy propagation still needs to understand
> all the stride restrictions, and those restrictions are where most of
> the fixes were.

But you do realize that it's only because of your proposed "fix" for
copy propagation to propagate via partial writes that it's then going to
un-do your lowered strides what then brings you to have to make it aware
of these hardware restrictions -- Otherwise if you just implemented them
in a lowering pass and then ran copy propagation it wouldn't even
consider the partial writes used to legalize the strides as valid
propagation candidates AFAIK.

> I didn't have to fix anything to understand that a DF->F move needs to
> have a destination stride of 2, besides this code -- nothing else
> touches it, so the places where I needed to take stride into account
> when manipulating instructions of 64-bit type were here and when
> emitting the special MOV for DF->F. So in the end, adding a stride
> legalization pass would be more code than what I have now, and for not
> much benefit.
>
>>
>> And BTW, this patch doesn't take into account the larger number of VGRFs
>> you would need to hold the temporaries in strided and offset form.
>
> It does for destinations but not sources, since we were already using
> regs_written for destinations and over-allocating anyways.
>
No, it's not, it allocates just enough space for the number of vector
components times the lowered width.

>>
>>> I should probably put some of this in the commit message...
>>>
>>> Also, I realized that this patch should really be split in two, since
>>> the fixes to register renaming are independent of this -- whoops!
>>>
>>>>
>>>>> +                  split_inst.src[j].stride = inst->src[j].stride;
>>>>>                    emit_transpose(lbld.group(copy_width, 0),
>>>>>                                   split_inst.src[j], &src, 1, src_size, n);
>>>>>                 }
>>>>> @@ -4343,8 +4347,10 @@ fs_visitor::lower_simd_width()
>>>>>                 /* Allocate enough space to hold the result of the lowered
>>>>>                  * instruction and fix up the number of registers written.
>>>>>                  */
>>>>> -               split_inst.dst = dsts[i] =
>>>>> -                  lbld.vgrf(inst->dst.type, dst_size);
>>>>> +               fs_reg dst = lbld.vgrf(inst->dst.type, dst_size);
>>>>> +               dst.stride = inst->dst.stride;
>>>>> +               dst.subreg_offset = inst->dst.subreg_offset;
>>>>> +               split_inst.dst = dsts[i] = dst;
>>>>>                 split_inst.regs_written =
>>>>>                    DIV_ROUND_UP(inst->regs_written * lower_width,
>>>>>                                 inst->exec_size);
>>>>> --
>>>>> 2.4.3
>>>>>
>>>>> _______________________________________________
>>>>> mesa-dev mailing list
>>>>> mesa-dev at lists.freedesktop.org
>>>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20150818/ad24bc54/attachment-0001.sig>