[Mesa-dev] [PATCH 09/12] i965/fs: make SIMD-splitting respect the original stride/offset
Connor Abbott
cwabbott0 at gmail.com
Mon Aug 17 01:33:17 PDT 2015
On Mon, Aug 17, 2015 at 1:27 AM, Connor Abbott <cwabbott0 at gmail.com> wrote:
> On Mon, Aug 17, 2015 at 12:09 AM, Pohjolainen, Topi
> <topi.pohjolainen at intel.com> wrote:
>> On Fri, Aug 14, 2015 at 03:30:18PM -0700, Connor Abbott wrote:
>>> In some cases, we need to emit ALU instructions with a certain stride
>>> due to a HW limitation. When splitting that instruction, we need to
>>> respect the original stride when creating the temporaries we load from
>>> and store into. Otherwise, we'll reintroduce the problem we were trying
>>> to work around.
>>>
>>> Signed-off-by: Connor Abbott <connor.w.abbott at intel.com>
>>> ---
>>> src/mesa/drivers/dri/i965/brw_fs.cpp | 14 ++++++++++----
>>> 1 file changed, 10 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>> index 812648f..386e9a2 100644
>>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
>>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>> @@ -2370,12 +2370,14 @@ fs_visitor::opt_register_renaming()
>>>
>>> if (depth == 0 &&
>>> inst->dst.file == GRF &&
>>> - alloc.sizes[inst->dst.reg] == inst->exec_size / 8 &&
>>> + alloc.sizes[inst->dst.reg] ==
>>> + inst->dst.component_size(inst->exec_size) &&
>>> !inst->is_partial_write()) {
>>> if (remap[dst] == -1) {
>>> remap[dst] = dst;
>>> } else {
>>> - remap[dst] = alloc.allocate(inst->exec_size / 8);
>>> + remap[dst] =
>>> + alloc.allocate(inst->dst.component_size(inst->exec_size));
>>> inst->dst.reg = remap[dst];
>>> progress = true;
>>> }
>>> @@ -4334,6 +4336,8 @@ fs_visitor::lower_simd_width()
>>> * temporary passed as source to the lowered instruction.
>>> */
>>> split_inst.src[j] = lbld.vgrf(inst->src[j].type, src_size);
>>> + split_inst.src[j].subreg_offset = inst->src[j].subreg_offset;
>>
>> The fixes for the wider component size (64-bits) and the stride are clearly
>> needed for doubles. I'm wondering though why the sub-register offset hasn't
>> caused us any problems before. That change is not needed just for doubles,
>> is it?
>
> I think we haven't noticed this before for two reasons:
>
> 1. It usually isn't needed for correctness.
> 2. We don't really use subreg_offset that much anyways.
>
> The splitting code is moving the inputs into a bunch of temporary
> registers, doing the split operation into a new bunch of temporaries,
> and then moving those temporaries back into the destination register.
> It's copying over the stride and subreg_offset fields from the
> original inputs and outputs, so when all is said and done, it's
> producing the same result and using the same sources. The difference
> is that the split instructions themselves weren't copying over those
> fields, meaning that they were always operating on packed registers
> (stride of 1 and subreg_offset of 0). This is normally ok, except
> where, due to a restriction, the split instruction *must* have the
> same stride/offset as the original instruction. For fp64, DF->F
> conversion must preserve the stride since we have to use a stride of 2
> for the output. The only thing I can think of that has to preserve the
> subreg_offset is math instructions, since the source offset has to be
> the same as the destination offset, but that should be a corner case
> and I'm not sure whether we'll ever get in a situation where a math
> instruction has a non-zero subreg_offset.
Actually, if we don't respect the subreg_offset, then both of the
lowered instructions will have an offset of 0, which is totally
fine... so I guess the subreg_offset part is solely about helping
optimizations (and consistency).
>
> But although I don't think it's necessary for correctness, but I think
> it should help with register coalescing. For example, take
> pack/unpackDouble2x32. I've implemented it in terms of two strided
> MOV's, which need to be split by this pass. Say we have something
> like:
>
> uint hi = unpackDouble2x32(input).y;
>
> in SIMD16 mode, this turns into:
>
> mov(16) hi.1<2>:UD input:DF
>
> But this needs to be split, and after the splitting pass (without
> fixing up the subreg_offsest like this patch does) and lowering away
> all the LOAD_PAYLOAD's, it gets turned into something like:
>
> mov(8) temp1:DF input:DF 1Q
> mov(8) temp2:DF input+1.0:DF 2Q
> mov(8) temp3<2>:UD temp1:DF 1Q
> mov(8) temp4<2>:UD temp2:DF 2Q
> mov(8) hi+0.1<2>:UD temp3<2>:UD WE_all //The WE_all is due to a hack
> in the SIMD lowering code
> mov(8) hi+1.1<2>:UD temp4<2>:UD WE_all
>
> Copy prop should have no problem cleaning up the first two MOV's, but
> the last two are more of an issue -- we currently don't have a pass
> that can detect that you can offset all reads/writes of temp3 and
> temp4 by 1 and coalesce temp3 and temp4 into hi and hi+1, and it would
> be really hard to write something like that without SSA. But if we
> pass through the offset while splitting it up, then things match up:
>
> mov(8) temp1:DF input:DF 1Q
> mov(8) temp2:DF input+1.0:DF 2Q
> mov(8) temp3+0.1<2>:UD temp1:DF 1Q
> mov(8) temp4+0.1<2>:UD temp2:DF 2Q
> mov(8) hi+0.1<2>:UD temp3+0.1<2>:UD WE_all
> mov(8) hi+1.1<2>:UD temp4+0.1<2>:UD WE_all
>
> and we should be able to coalesce away all the extra MOV's.
>
> I should probably put some of this in the commit message...
>
> Also, I realized that this patch should really be split in two, since
> the fixes to register renaming are independent of this -- whoops!
>
>>
>>> + split_inst.src[j].stride = inst->src[j].stride;
>>> emit_transpose(lbld.group(copy_width, 0),
>>> split_inst.src[j], &src, 1, src_size, n);
>>> }
>>> @@ -4343,8 +4347,10 @@ fs_visitor::lower_simd_width()
>>> /* Allocate enough space to hold the result of the lowered
>>> * instruction and fix up the number of registers written.
>>> */
>>> - split_inst.dst = dsts[i] =
>>> - lbld.vgrf(inst->dst.type, dst_size);
>>> + fs_reg dst = lbld.vgrf(inst->dst.type, dst_size);
>>> + dst.stride = inst->dst.stride;
>>> + dst.subreg_offset = inst->dst.subreg_offset;
>>> + split_inst.dst = dsts[i] = dst;
>>> split_inst.regs_written =
>>> DIV_ROUND_UP(inst->regs_written * lower_width,
>>> inst->exec_size);
>>> --
>>> 2.4.3
>>>
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
More information about the mesa-dev
mailing list