[Mesa-dev] [PATCH 09/12] i965/fs: make SIMD-splitting respect the original stride/offset

Francisco Jerez currojerez at riseup.net
Tue Aug 18 10:53:16 PDT 2015

Connor Abbott <cwabbott0 at gmail.com> writes:

> On Tue, Aug 18, 2015 at 2:29 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>> Connor Abbott <cwabbott0 at gmail.com> writes:
>>> On Mon, Aug 17, 2015 at 2:41 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>>>> Connor Abbott <cwabbott0 at gmail.com> writes:
>>>>> On Mon, Aug 17, 2015 at 12:09 AM, Pohjolainen, Topi
>>>>> <topi.pohjolainen at intel.com> wrote:
>>>>>> On Fri, Aug 14, 2015 at 03:30:18PM -0700, Connor Abbott wrote:
>>>>>>> In some cases, we need to emit ALU instructions with a certain stride
>>>>>>> due to a HW limitation. When splitting that instruction, we need to
>>>>>>> respect the original stride when creating the temporaries we load from
>>>>>>> and store into. Otherwise, we'll reintroduce the problem we were trying
>>>>>>> to work around.
>>>>>>> Signed-off-by: Connor Abbott <connor.w.abbott at intel.com>
>>>>>>> ---
>>>>>>>  src/mesa/drivers/dri/i965/brw_fs.cpp | 14 ++++++++++----
>>>>>>>  1 file changed, 10 insertions(+), 4 deletions(-)
>>>>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>>>>> index 812648f..386e9a2 100644
>>>>>>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>>>>> @@ -2370,12 +2370,14 @@ fs_visitor::opt_register_renaming()
>>>>>>>        if (depth == 0 &&
>>>>>>>            inst->dst.file == GRF &&
>>>>>>> -          alloc.sizes[inst->dst.reg] == inst->exec_size / 8 &&
>>>>>>> +          alloc.sizes[inst->dst.reg] ==
>>>>>>> +            inst->dst.component_size(inst->exec_size) &&
>>>>>>>            !inst->is_partial_write()) {
>>>>>>>           if (remap[dst] == -1) {
>>>>>>>              remap[dst] = dst;
>>>>>>>           } else {
>>>>>>> -            remap[dst] = alloc.allocate(inst->exec_size / 8);
>>>>>>> +            remap[dst] =
>>>>>>> +               alloc.allocate(inst->dst.component_size(inst->exec_size));
>>>>>>>              inst->dst.reg = remap[dst];
>>>>>>>              progress = true;
>>>>>>>           }
>>>>>>> @@ -4334,6 +4336,8 @@ fs_visitor::lower_simd_width()
>>>>>>>                     * temporary passed as source to the lowered instruction.
>>>>>>>                     */
>>>>>>>                    split_inst.src[j] = lbld.vgrf(inst->src[j].type, src_size);
>>>>>>> +                  split_inst.src[j].subreg_offset = inst->src[j].subreg_offset;
>>>>>> The fixes for the wider component size (64-bits) and the stride are clearly
>>>>>> needed for doubles. I'm wondering though why the sub-register offset hasn't
>>>>>> caused us any problems before. That change is not needed just for doubles,
>>>>>> is it?
>>>>> I think we haven't noticed this before for two reasons:
>>>>> 1. It usually isn't needed for correctness.
>>>>> 2. We don't really use subreg_offset that much anyways.
>>>>> The splitting code is moving the inputs into a bunch of temporary
>>>>> registers, doing the split operation into a new bunch of temporaries,
>>>>> and then moving those temporaries back into the destination register.
>>>>> It's copying over the stride and subreg_offset fields from the
>>>>> original inputs and outputs, so when all is said and done, it's
>>>>> producing the same result and using the same sources. The difference
>>>>> is that the split instructions themselves weren't copying over those
>>>>> fields, meaning that they were always operating on packed registers
>>>>> (stride of 1 and subreg_offset of 0). This is normally ok, except
>>>>> where, due to a restriction, the split instruction *must* have the
>>>>> same stride/offset as the original instruction. For fp64, DF->F
>>>>> conversion must preserve the stride since we have to use a stride of 2
>>>>> for the output. The only thing I can think of that has to preserve the
>>>>> subreg_offset is math instructions, since the source offset has to be
>>>>> the same as the destination offset, but that should be a corner case
>>>>> and I'm not sure whether we'll ever get in a situation where a math
>>>>> instruction has a non-zero subreg_offset.
>>>> The SIMD lowering pass always uses a suboffset of zero for all
>>>> non-scalar sources of the lowered instructions.  I'm not aware of any
>>>> case in which that could be a problem.
>>>>> But although I don't think it's necessary for correctness, but I think
>>>>> it should help with register coalescing. For example, take
>>>>> pack/unpackDouble2x32. I've implemented it in terms of two strided
>>>>> MOV's, which need to be split by this pass. Say we have something
>>>>> like:
>>>>> uint hi = unpackDouble2x32(input).y;
>>>>> in SIMD16 mode, this turns into:
>>>>> mov(16) hi.1<2>:UD input:DF
>>>> I don't think this is a valid implementation of unpackDouble2x32.
>>>> The partial write and offset destination shouldn't be necessary and as
>>>> you've already noticed it will prevent a number of optimization passes
>>>> From doing anything useful with your instruction.
>>> Right, I had my example backwards. It's packDouble2x32 that you have
>>> to emit a MOV like that for. It would be something like:
>>> mov(16) output.1<2>:UD input_y:UD
>> And how exactly does it help the optimizer to introduce additional
>> partial writes?  AFAICT this should already be simplified without any
>> changes to the current copy propagation pass:
>> | mov(8) temp1:UD input:UD 1Q
>> | mov(8) temp2:UD input+1.0:UD 2Q
>> | mov(8) temp3<1>:UD temp1:UD 1Q
>> | mov(8) temp4<1>:UD temp2:UD 2Q
>> | mov(8) output+0.1<2>:UD temp3:UD WE_all
>> | mov(8) output+1.1<2>:UD temp4:UD WE_all
>> into:
>> | mov(8) output+0.1<2>:UD input:UD WE_all
>> | mov(8) output+1.1<2>:UD input+1.0:UD WE_all
> I think you're right, since we happen to be dealing with MOV's that
> can be coalesed away.
>> However what you seem to be suggesting:
>> | mov(8) temp1:UD input:UD 1Q
>> | mov(8) temp2:UD input+1.0:UD 2Q
>> | mov(8) temp3+0.1<2>:UD temp1:UD 1Q
>> | mov(8) temp4+0.1<2>:UD temp2:UD 2Q
>> | mov(8) output+0.1<2>:UD temp3+0.1<2>:UD WE_all
>> | mov(8) output+1.1<2>:UD temp4+0.1<2>:UD WE_all
>> Is unlikely to be simplified further than:
>> | mov(8) temp3+0.1<2>:UD input:UD 1Q
>> | mov(8) temp4+0.1<2>:UD input+1.0:UD 2Q
>> | mov(8) output+0.1<2>:UD temp3+0.1<2>:UD WE_all
>> | mov(8) output+1.1<2>:UD temp4+0.1<2>:UD WE_all
>> without introducing a whole new class of copy propagation via partial
>> writes, which is going to be dubiously useful in the long run if we're
>> planning to migrate the compiler back-end to SSA eventually.
>> I think ideally you would implement packDouble2x32 using an SSA-frendly
>> pseudo-instruction, like:
>> | PACK(16) dst:UQ src0:UD src1:UD
>> It would be cool if it could take as many sources as required to fill
>> the destination type, e.g. it would be useful for the image_load_store
>> to be able to do:
>> | PACK(16) dst:UD src0:UB src1:UB src2:UB src3:UB
>> It could probably replace the similar but less general VEC4 instruction.
>> This is currently handled in the image packing and unpacking code using
>> bitwise arithmetic precisely to avoid partial writes.
> Yeah, we'll definitely need something like that when we get to
> SSA-ifying things. I don't see any particular benefits that it gives
> us until then, though, and you haven't pointed out any. There are so
> many things we want/need to change with the move to SSA:
> - Fixing up the frontend to do predicated things with select
> instructions, and coalescing those away.
> - Doing SSA with flag registers, so that we can more freely move CMP
> instructions around and take advantage of all the flag registers
> instead of hardcoding one for discard.
> - Having the type, quarter control, and exec_all be on the SSA
> destination and validating that things are consistent so that figuring
> out interference is easier and we can have a LOAD_PAYLOAD that has
> different sources with different quarter controls.
> - Handling strides like what you suggest, where there are no partial
> writes and no strides and perhaps only limited partial reads through
> retyping sources to a smaller type (not clear whether having this as a
> dedicated opcode is easier), and there's a separate pass that figures
> out what the stride should be based on HW restrictions.
> - Doing SSA-based register allocation and somehow dealing with
> incompatible exec_mask's, and as a consequence, actually doing real
> spilling based on register pressure.
> - Probably something I forgot.
> and so many of them are interconnected and won't even make sense until
> we have SSA, that it's not always useful to think about the IR we want
> to have in the distant future in order to decide what to do today.
> We've talked about scrapping FS and completely re-writing it, and even
> if we don't, there's going to be a *massive* amount of churn. I think
> you've managed to convince me that having a separate lowering pass
> that deals with the DF->F workaround and dropping this patch is
> useful, since it avoids having to do copy propagation with arbitrary
> strides/offsets, but unless there's some other practical benefit, I'm
> not inclined to add more stuff that will only be useful for SSA. If
> the SSA-based thing is going to be useful today, then sure, let's do
> it, but otherwise it's just adding dead code.

I thought that making the FS back-end more SSA-friendly incrementally
was the main point of e.g. the LOAD_PAYLOAD instruction.  Such a PACK
instruction would be immediately useful for the image load/store
implementation, for your implementation of the double-packing built-ins,
for the FS implementation of some other GLSL packing built-ins, and for
the VEC4 back-end (in fact the VEC4 back-end already has a similar
instruction but it's somewhat less general than what I described).
Regardless of the back-end IR being SSA-form or not it would likely make
things easier to CSE than a sequence of partial writes.

Anyway I wasn't suggesting that you should necessarily re-implement your
DF packing built-ins in terms of such an instruction, it's just
something that has been on my wishlist for quite a while and I was
hoping you'd like the idea enough to go ahead and implement it. ;)

>>>> I think you'd want something like (pre-SIMD lowering):
>>>> mov(16) lo<1>:UD input.0<16;8,2>:UD
>>>> mov(16) hi<1>:UD input.1<16;8,2>:UD
>>>>> But this needs to be split, and after the splitting pass (without
>>>>> fixing up the subreg_offsest like this patch does) and lowering away
>>>>> all the LOAD_PAYLOAD's, it gets turned into something like:
>>>>> mov(8) temp1:DF input:DF 1Q
>>>>> mov(8) temp2:DF input+1.0:DF 2Q
>>>>> mov(8) temp3<2>:UD temp1:DF 1Q
>>>>> mov(8) temp4<2>:UD temp2:DF 2Q
>>>>> mov(8) hi+0.1<2>:UD temp3<2>:UD WE_all //The WE_all is due to a hack
>>>>> in the SIMD lowering code
>>>>> mov(8) hi+1.1<2>:UD temp4<2>:UD WE_all
>>>>> Copy prop should have no problem cleaning up the first two MOV's, but
>>>>> the last two are more of an issue -- we currently don't have a pass
>>>>> that can detect that you can offset all reads/writes of temp3 and
>>>>> temp4 by 1 and coalesce temp3 and temp4 into hi and hi+1, and it would
>>>>> be really hard to write something like that without SSA. But if we
>>>>> pass through the offset while splitting it up, then things match up:
>>>>> mov(8) temp1:DF input:DF 1Q
>>>>> mov(8) temp2:DF input+1.0:DF 2Q
>>>>> mov(8) temp3+0.1<2>:UD temp1:DF 1Q
>>>>> mov(8) temp4+0.1<2>:UD temp2:DF 2Q
>>>>> mov(8) hi+0.1<2>:UD temp3+0.1<2>:UD WE_all
>>>>> mov(8) hi+1.1<2>:UD temp4+0.1<2>:UD WE_all
>>>>> and we should be able to coalesce away all the extra MOV's.
>>>> I doubt this will be any easier to simplify than your first snippet.
>>>> You still have strided and offset writes of temp3-4 and hi, so AFAIK
>>>> both copy propagation and register coalesce are going to freak out in
>>>> both cases.
>>> It should be quite easy to handle this in copy propagation, though --
>>> allow partial writes (although consider them to write to the whole
>>> register for simplicity) and only propagate if stride/offset match.
>>> It's effectively doing the (more difficult) first part of coalescing
>>> away the MOV, changing the offset and stride of temp3, for you.
>> I don't think it's about it being difficult to implement...  In fact a
>> couple of weeks ago I changed the copy propagation pass to handle
>> arbitrary strided copies.  But after some thought I concluded it wasn't
>> worth sending for review because we probably want to eliminate most
>> cases of partial writes from the back-end in the long run anyway, so it
>> seemed like unnecessary complexity.
>>>> IMO we want to avoid partial writes as much as possible,
>>>> and introduce them as late as possible in the optimization loop.  I
>>>> think it would be easiest if the visitor wouldn't emit any partial
>>>> writes at all -- even, say, on CHV/BXT with the register alignment
>>>> limitation on instructions with 64-bit execution types in place.  Later
>>>> on we could run a register alignment lowering pass that would legalize
>>>> the strides where they aren't supported natively by the hardware, that
>>>> way you would:
>>>>  - Be able to implement these restrictions in one place rather than
>>>>    having to take them into account anywhere you emit or manipulate
>>>>    instructions of 64-bit execution type.
>>>>  - Emit partial writes in cases where you hit some hardware limitation
>>>>    *only* -- As you're doing it here the SIMD lowering pass will emit
>>>>    partial writes for all instructions that have a strided source (and
>>>>    emit additional and most often unnecessary partial writes for
>>>>    instructions that are already a partial write).  This breaks the
>>>>    assumption of this pass that the zipping and unzipping copies are
>>>>    cheap to copy-propagate from, and will definitely increase the
>>>>    instruction count when this lowering pass is used, regardless of
>>>>    whether the lowered instructions actually have any of the hardware
>>>>    limitations you are trying to work around.
>>> I think we can just drop the part that sets the source stride. AFAIK
>>> there aren't any double operations I'm emitting that need a source
>>> with a special stride due to HW limitations; I just put it there for
>>> consistency. I don't think it's actually used, and as you said it's
>>> probably harmful. We still need to keep it around for the
>>> destinations, since there we're going to emit partial writes anyways,
>>> so we might as well emit partial writes that we have a chance to
>>> coalesce. We could add a special lowering pass for DF->F, since that's
>>> the only operation that requires special handling wrt strides, but
>>> that still doesn't solve the coalescing problem with packDouble2x32.
>> As I said earlier this should probably be simplified by copy
>> propagation.
>>> It also won't solve most of the problems you mentioned, since when we
>>> do the lowering to create the partial write, we're still going to emit
>>> a second MOV, and we're still going to want to coalesce it away for
>>> everything where we can, so copy propagation still needs to understand
>>> all the stride restrictions, and those restrictions are where most of
>>> the fixes were.
>> But you do realize that it's only because of your proposed "fix" for
>> copy propagation to propagate via partial writes that it's then going to
>> un-do your lowered strides what then brings you to have to make it aware
>> of these hardware restrictions -- Otherwise if you just implemented them
>> in a lowering pass and then ran copy propagation it wouldn't even
>> consider the partial writes used to legalize the strides as valid
>> propagation candidates AFAIK.
> No, we still need to make copy progagation aware of these
> restrictions, even with the lowering pass. For example, say you have
> something like:
> MOV(8) temp:F input:DF
> SQRT(8) out1:F temp:F //some math instruction that uses temp
> ADD(8) out2:F temp:F foo:F //some normal instruction that uses temp
> This is going to get legalized to:
> MOV(8) temp2<2>:F input:DF
> MOV(8) temp:F temp2<2>:F
> SQRT(8) out1:F temp:F //some math instruction that uses temp
> ADD(8) out2:F temp:F foo:F //some normal instruction that uses temp
> Now, we want to copy-propagate into the normal instruction but not the
> math instruction. Either we could make copy-propagation aware of this,
> like my patch does, or we could add yet another pass that legalizes
> math instructions' strides (and sometimes produces code that could be
> CSE'd). But you can't do both in the same pass, since it would then
> fight with copy propagation. Given the two options, fixing copy
> propagation seems easier to me.

Ah, right, I see why it would be useful to fix the copy propagation pass
in addition.  I guess that to avoid duplicating the logic between copy
propagation and the stride lowering pass it would make sense to
implement your can_take_stride() helper as an instruction query kind of
like backend_instruction::can_do_source_mods().

>>> I didn't have to fix anything to understand that a DF->F move needs to
>>> have a destination stride of 2, besides this code -- nothing else
>>> touches it, so the places where I needed to take stride into account
>>> when manipulating instructions of 64-bit type were here and when
>>> emitting the special MOV for DF->F. So in the end, adding a stride
>>> legalization pass would be more code than what I have now, and for not
>>> much benefit.
>>>> And BTW, this patch doesn't take into account the larger number of VGRFs
>>>> you would need to hold the temporaries in strided and offset form.
>>> It does for destinations but not sources, since we were already using
>>> regs_written for destinations and over-allocating anyways.
>> No, it's not, it allocates just enough space for the number of vector
>> components times the lowered width.
>>>>> I should probably put some of this in the commit message...
>>>>> Also, I realized that this patch should really be split in two, since
>>>>> the fixes to register renaming are independent of this -- whoops!
>>>>>>> +                  split_inst.src[j].stride = inst->src[j].stride;
>>>>>>>                    emit_transpose(lbld.group(copy_width, 0),
>>>>>>>                                   split_inst.src[j], &src, 1, src_size, n);
>>>>>>>                 }
>>>>>>> @@ -4343,8 +4347,10 @@ fs_visitor::lower_simd_width()
>>>>>>>                 /* Allocate enough space to hold the result of the lowered
>>>>>>>                  * instruction and fix up the number of registers written.
>>>>>>>                  */
>>>>>>> -               split_inst.dst = dsts[i] =
>>>>>>> -                  lbld.vgrf(inst->dst.type, dst_size);
>>>>>>> +               fs_reg dst = lbld.vgrf(inst->dst.type, dst_size);
>>>>>>> +               dst.stride = inst->dst.stride;
>>>>>>> +               dst.subreg_offset = inst->dst.subreg_offset;
>>>>>>> +               split_inst.dst = dsts[i] = dst;
>>>>>>>                 split_inst.regs_written =
>>>>>>>                    DIV_ROUND_UP(inst->regs_written * lower_width,
>>>>>>>                                 inst->exec_size);
>>>>>>> --
>>>>>>> 2.4.3
>>>>>>> _______________________________________________
>>>>>>> mesa-dev mailing list
>>>>>>> mesa-dev at lists.freedesktop.org
>>>>>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>>> _______________________________________________
>>>>> mesa-dev mailing list
>>>>> mesa-dev at lists.freedesktop.org
>>>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20150818/eebd26a1/attachment-0001.sig>

More information about the mesa-dev mailing list