[Mesa-dev] [PATCH v3 16/24] i965/vec4: fix SIMD-width lowering for VEC4_OPCODE_FROM_DOUBLE in IVB/BYT

Wed Mar 8 19:30:21 UTC 2017

Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:

> On 07/03/17 22:24, Francisco Jerez wrote:
>> Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
>> 
>>> On 04/03/17 01:04, Francisco Jerez wrote:
>>>> Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
>>>>
>>>>> From: "Juan A. Suarez Romero" <jasuarez at igalia.com>
>>>>>
>>>>> When splitting VEC4_OPCODE_FROM_DOUBLE in Ivybridge/Baytrail, the second
>>>>> part should use a temporal register, and then move the values to the
>>>>> second half of the original destination, so we get all the results in the
>>>>> same register.
>>>>>
>>>>> v2:
>>>>> - Fix typos (Matt).
>>>>> ---
>>>>>  src/mesa/drivers/dri/i965/brw_vec4.cpp           | 17 +++++++++++++----
>>>>>  src/mesa/drivers/dri/i965/brw_vec4_generator.cpp |  1 +
>>>>>  2 files changed, 14 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/src/mesa/drivers/dri/i965/brw_vec4.cpp b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>>>> index 64b435f3ec4..adcde085305 100644
>>>>> --- a/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>>>> +++ b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>>>> @@ -2191,9 +2191,15 @@ vec4_visitor::lower_simd_width()
>>>>>           linst->group = channel_offset;
>>>>>           linst->size_written = size_written;
>>>>>  
>>>>> +         /* When splitting VEC4_OPCODE_FROM_DOUBLE on Ivybridge, the second part
>>>>> +          * should use in a temporal register. Later we will move the values
>>>>> +          * to the second half of the original destination, so we get all the
>>>>> +          * results in the same register. We use d2f_pass to detect this case.
>>>>> +          */
>>>>> +         bool d2f_pass = (inst->opcode == VEC4_OPCODE_FROM_DOUBLE && n > 0);
>>>>>           /* Compute split dst region */
>>>>>           dst_reg dst;
>>>>> -         if (needs_temp) {
>>>>> +         if (needs_temp || d2f_pass) {
>>>>>              unsigned num_regs = DIV_ROUND_UP(size_written, REG_SIZE);
>>>>>              dst = retype(dst_reg(VGRF, alloc.allocate(num_regs)),
>>>>>                           inst->dst.type);
>>>>> @@ -2226,9 +2232,12 @@ vec4_visitor::lower_simd_width()
>>>>>           /* If we used a temporary to store the result of the split
>>>>>            * instruction, copy the result to the original destination
>>>>>            */
>>>>> -         if (needs_temp) {
>>>>> -            vec4_instruction *mov =
>>>>> -               MOV(offset(inst->dst, lowered_width, n), src_reg(dst));
>>>>> +         if (needs_temp || d2f_pass) {
>>>>> +            vec4_instruction *mov;
>>>>> +            if (d2f_pass)
>>>>> +               mov = MOV(horiz_offset(inst->dst, n * type_sz(inst->dst.type)), src_reg(dst));
>>>>
>>>> I have no idea how this could possibly work...  horiz_offset() expects a
>>>> number of scalar components, not bytes.  Anyway I have a hunch this is
>>>> trying to workaround the bug I pointed out in PATCH 15...
>>>>
>>>
>>> This worked by luck. We want an horizontal offset of 'lowered_width'
>>> components per split. As inst->dst.type is F, type_sz(inst->dst.type) =
>>> 4 and we set the maximum 'lowered_width' value to 4, it matches in the
>>> cases we tested... however, this is a bug.
>>>
>>> Actually it works too if we use offset() instead, so this hunk is not
>>> needed.
>>>
>> 
>> This patch as a whole makes no sense to me...  It's working around the
>> dubious behavior of VEC4_OPCODE_FROM_DOUBLE which doesn't behave like
>> any other arithmetic instruction of the back-end because it corrupts
>> data beyond its destination region, which means any reasoning done about
>> it within the optimizer is suspect of leading to data corruption in the
>> same way that you got the SIMD lowering pass to inadvertently cause
>> corruption.  The problem IMO is that you're pretending that something
>> that doesn't behave like a back-end instruction is an instruction.  I
>> suggest you split the datatype conversion from the gathering/scattering
>> of 32-bit fields as separate instructions -- Turns out that you already
>> have the latter implemented as VEC4_OPCODE_PICK/SET_LOW_32BIT.
>> 
>> Incidentally doing things that way would allow better performance
>> because the two operations would become visible to the scheduler and an
>> EU pipeline stall could potentially be avoided (though performance is by
>> no means my primary motivation for NAKing this patch, but rather
>> preserving our sanity in the long run...).
>> 
>
> I am wondering if we can just fix the data corruption produced by
> VEC4_OPCODE_FROM_DOUBLE in the generator, instead of splitting the
> instruction. This has the advantage that we have the result of the
> opcode already shrank instead of needing VEC4_OPCODE_PICK_LOW_32BIT, so
> if any lowering/optimization emits a new VEC4_OPCODE_FROM_DOUBLE, we
> don't need to remember to emit VEC4_OPCODE_PICK_LOW_32BIT together with it.
>

Lowering/optimization passes aren't going to emit a new
VEC4_OPCODE_FROM_DOUBLE opcode unless one was already there, in which
case there will certainly be an accompanying VEC4_OPCODE_PICK_LOW_32BIT
as long as the NIR translation pass is emitting valid code.  You
shouldn't need to introduce any additional complexity into the optimizer
in order to fix this.

> This would be the diff (against v3, not v4):
>
> diff --git a/src/mesa/drivers/dri/i965/brw_vec4_generator.cpp
> b/src/mesa/drivers/dri/i965/brw_vec4_generator.cpp
> index f6034bc..dc5f437 100644
> --- a/src/mesa/drivers/dri/i965/brw_vec4_generator.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_vec4_generator.cpp
> @@ -1973,7 +1973,9 @@ generate_code(struct brw_codegen *p,
>
>           dst.hstride = BRW_HORIZONTAL_STRIDE_1;
>           dst.width = BRW_WIDTH_8;
> +         brw_set_default_exec_size(p, cvt(inst->exec_size) - 1);
>           brw_MOV(p, dst, dst_as_src);
> +         brw_set_default_exec_size(p, cvt(doubled_exec_size) - 1);
>
>           brw_set_default_access_mode(p, BRW_ALIGN_16);
>           break;
>

This is fixing *one* problem but the MOV emitted immediately before that
hunk will still corrupt data beyond the 32B bounds of the align16 float
destination region.

> However, the EU pipeline stall issue you mention would be still there.
>
> Nevertheless, I am OK with your suggestion, no strong opinion against
> it. So, while waiting for your answer, I am going to implement your
> suggestion and add it to v4 unless you prefer this one.
>
> Sam
>
>>> Sam
>>>
>>>>> +            else
>>>>> +               mov = MOV(offset(inst->dst, lowered_width, n), src_reg(dst));
>>>>>              mov->exec_size = lowered_width;
>>>>>              mov->group = channel_offset;
>>>>>              mov->size_written = size_written;
>>>>> diff --git a/src/mesa/drivers/dri/i965/brw_vec4_generator.cpp b/src/mesa/drivers/dri/i965/brw_vec4_generator.cpp
>>>>> index 7fa1afc9073..b570792badd 100644
>>>>> --- a/src/mesa/drivers/dri/i965/brw_vec4_generator.cpp
>>>>> +++ b/src/mesa/drivers/dri/i965/brw_vec4_generator.cpp
>>>>> @@ -1532,6 +1532,7 @@ generate_code(struct brw_codegen *p,
>>>>>               is_ivb_df);
>>>>>  
>>>>>        assert(inst->group % 8 == 0 ||
>>>>> +             (inst->exec_size == 4 && inst->group % 4 == 0) ||
>>>>>               inst->dst.type == BRW_REGISTER_TYPE_DF ||
>>>>>               inst->src[0].type == BRW_REGISTER_TYPE_DF ||
>>>>>               inst->src[1].type == BRW_REGISTER_TYPE_DF ||
>>>>> -- 
>>>>> 2.11.0
>>>>>
>>>>> _______________________________________________
>>>>> mesa-dev mailing list
>>>>> mesa-dev at lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20170308/e1f59d47/attachment.sig>