[Mesa-dev] [PATCH v3 16/24] i965/vec4: fix SIMD-width lowering for VEC4_OPCODE_FROM_DOUBLE in IVB/BYT

Thu Mar 9 07:25:25 UTC 2017

On 08/03/17 20:30, Francisco Jerez wrote:
> Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
> 
>> On 07/03/17 22:24, Francisco Jerez wrote:
>>> Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
>>>
>>>> On 04/03/17 01:04, Francisco Jerez wrote:
>>>>> Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
>>>>>
>>>>>> From: "Juan A. Suarez Romero" <jasuarez at igalia.com>
>>>>>>
>>>>>> When splitting VEC4_OPCODE_FROM_DOUBLE in Ivybridge/Baytrail, the second
>>>>>> part should use a temporal register, and then move the values to the
>>>>>> second half of the original destination, so we get all the results in the
>>>>>> same register.
>>>>>>
>>>>>> v2:
>>>>>> - Fix typos (Matt).
>>>>>> ---
>>>>>>  src/mesa/drivers/dri/i965/brw_vec4.cpp           | 17 +++++++++++++----
>>>>>>  src/mesa/drivers/dri/i965/brw_vec4_generator.cpp |  1 +
>>>>>>  2 files changed, 14 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/src/mesa/drivers/dri/i965/brw_vec4.cpp b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>>>>> index 64b435f3ec4..adcde085305 100644
>>>>>> --- a/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>>>>> +++ b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>>>>> @@ -2191,9 +2191,15 @@ vec4_visitor::lower_simd_width()
>>>>>>           linst->group = channel_offset;
>>>>>>           linst->size_written = size_written;
>>>>>>  
>>>>>> +         /* When splitting VEC4_OPCODE_FROM_DOUBLE on Ivybridge, the second part
>>>>>> +          * should use in a temporal register. Later we will move the values
>>>>>> +          * to the second half of the original destination, so we get all the
>>>>>> +          * results in the same register. We use d2f_pass to detect this case.
>>>>>> +          */
>>>>>> +         bool d2f_pass = (inst->opcode == VEC4_OPCODE_FROM_DOUBLE && n > 0);
>>>>>>           /* Compute split dst region */
>>>>>>           dst_reg dst;
>>>>>> -         if (needs_temp) {
>>>>>> +         if (needs_temp || d2f_pass) {
>>>>>>              unsigned num_regs = DIV_ROUND_UP(size_written, REG_SIZE);
>>>>>>              dst = retype(dst_reg(VGRF, alloc.allocate(num_regs)),
>>>>>>                           inst->dst.type);
>>>>>> @@ -2226,9 +2232,12 @@ vec4_visitor::lower_simd_width()
>>>>>>           /* If we used a temporary to store the result of the split
>>>>>>            * instruction, copy the result to the original destination
>>>>>>            */
>>>>>> -         if (needs_temp) {
>>>>>> -            vec4_instruction *mov =
>>>>>> -               MOV(offset(inst->dst, lowered_width, n), src_reg(dst));
>>>>>> +         if (needs_temp || d2f_pass) {
>>>>>> +            vec4_instruction *mov;
>>>>>> +            if (d2f_pass)
>>>>>> +               mov = MOV(horiz_offset(inst->dst, n * type_sz(inst->dst.type)), src_reg(dst));
>>>>>
>>>>> I have no idea how this could possibly work...  horiz_offset() expects a
>>>>> number of scalar components, not bytes.  Anyway I have a hunch this is
>>>>> trying to workaround the bug I pointed out in PATCH 15...
>>>>>
>>>>
>>>> This worked by luck. We want an horizontal offset of 'lowered_width'
>>>> components per split. As inst->dst.type is F, type_sz(inst->dst.type) =
>>>> 4 and we set the maximum 'lowered_width' value to 4, it matches in the
>>>> cases we tested... however, this is a bug.
>>>>
>>>> Actually it works too if we use offset() instead, so this hunk is not
>>>> needed.
>>>>
>>>
>>> This patch as a whole makes no sense to me...  It's working around the
>>> dubious behavior of VEC4_OPCODE_FROM_DOUBLE which doesn't behave like
>>> any other arithmetic instruction of the back-end because it corrupts
>>> data beyond its destination region, which means any reasoning done about
>>> it within the optimizer is suspect of leading to data corruption in the
>>> same way that you got the SIMD lowering pass to inadvertently cause
>>> corruption.  The problem IMO is that you're pretending that something
>>> that doesn't behave like a back-end instruction is an instruction.  I
>>> suggest you split the datatype conversion from the gathering/scattering
>>> of 32-bit fields as separate instructions -- Turns out that you already
>>> have the latter implemented as VEC4_OPCODE_PICK/SET_LOW_32BIT.
>>>
>>> Incidentally doing things that way would allow better performance
>>> because the two operations would become visible to the scheduler and an
>>> EU pipeline stall could potentially be avoided (though performance is by
>>> no means my primary motivation for NAKing this patch, but rather
>>> preserving our sanity in the long run...).
>>>
>>
>> I am wondering if we can just fix the data corruption produced by
>> VEC4_OPCODE_FROM_DOUBLE in the generator, instead of splitting the
>> instruction. This has the advantage that we have the result of the
>> opcode already shrank instead of needing VEC4_OPCODE_PICK_LOW_32BIT, so
>> if any lowering/optimization emits a new VEC4_OPCODE_FROM_DOUBLE, we
>> don't need to remember to emit VEC4_OPCODE_PICK_LOW_32BIT together with it.
>>
> 
> Lowering/optimization passes aren't going to emit a new
> VEC4_OPCODE_FROM_DOUBLE opcode unless one was already there, in which
> case there will certainly be an accompanying VEC4_OPCODE_PICK_LOW_32BIT
> as long as the NIR translation pass is emitting valid code.  You
> shouldn't need to introduce any additional complexity into the optimizer
> in order to fix this.
> 

Actually we are emitting a new VEC4_OPCODE_FROM_DOUBLE opcode in patch
19, so we need to update it too.

If you can confirm the HW behavior mentioned in that patch in the
simulator (I can send you privately an email with a branch and a test
file to check, if you want), we can decide if that was the proper fix or
not.

>> This would be the diff (against v3, not v4):
>>
>> diff --git a/src/mesa/drivers/dri/i965/brw_vec4_generator.cpp
>> b/src/mesa/drivers/dri/i965/brw_vec4_generator.cpp
>> index f6034bc..dc5f437 100644
>> --- a/src/mesa/drivers/dri/i965/brw_vec4_generator.cpp
>> +++ b/src/mesa/drivers/dri/i965/brw_vec4_generator.cpp
>> @@ -1973,7 +1973,9 @@ generate_code(struct brw_codegen *p,
>>
>>           dst.hstride = BRW_HORIZONTAL_STRIDE_1;
>>           dst.width = BRW_WIDTH_8;
>> +         brw_set_default_exec_size(p, cvt(inst->exec_size) - 1);
>>           brw_MOV(p, dst, dst_as_src);
>> +         brw_set_default_exec_size(p, cvt(doubled_exec_size) - 1);
>>
>>           brw_set_default_access_mode(p, BRW_ALIGN_16);
>>           break;
>>
> 
> This is fixing *one* problem but the MOV emitted immediately before that
> hunk will still corrupt data beyond the 32B bounds of the align16 float
> destination region.
> 

I might be missing something obvious but I don't see the data corruption
you mention. When we emit VEC4_OPCODE_FROM_DOUBLE opcode in
emit_conversion_from_double() we do the following:

* Create a temporary destination register of dvec4 size. Retype it to F.

* Use it as destination for VEC4_OPCODE_FROM_DOUBLE opcode. Set the
size_written to indicate we write 2 * REG_SIZE, which is what we
allocated at first step.
   * Internally VEC4_OPCODE_FROM_DOUBLE opcode will spread the
     converted data to whole dvec4 with align1 and we shrink it
     to 32B bounds of the align16 float destination region before
     coming back to align16 mode.
* After emitting VEC4_OPCODE_FROM_DOUBLE opcode, we emit a MOV to copy
the converted value to the original destination register, which keeps
its writemask to avoid corrupting any unmodified data.

The SIMD lowering pass should keep a similar approach when splitting
this opcode (if not, it is a bug) to avoid any data corruption.

Sam

>> However, the EU pipeline stall issue you mention would be still there.
>>
>> Nevertheless, I am OK with your suggestion, no strong opinion against
>> it. So, while waiting for your answer, I am going to implement your
>> suggestion and add it to v4 unless you prefer this one.
>>
>> Sam
>>
>>>> Sam
>>>>
>>>>>> +            else
>>>>>> +               mov = MOV(offset(inst->dst, lowered_width, n), src_reg(dst));
>>>>>>              mov->exec_size = lowered_width;
>>>>>>              mov->group = channel_offset;
>>>>>>              mov->size_written = size_written;
>>>>>> diff --git a/src/mesa/drivers/dri/i965/brw_vec4_generator.cpp b/src/mesa/drivers/dri/i965/brw_vec4_generator.cpp
>>>>>> index 7fa1afc9073..b570792badd 100644
>>>>>> --- a/src/mesa/drivers/dri/i965/brw_vec4_generator.cpp
>>>>>> +++ b/src/mesa/drivers/dri/i965/brw_vec4_generator.cpp
>>>>>> @@ -1532,6 +1532,7 @@ generate_code(struct brw_codegen *p,
>>>>>>               is_ivb_df);
>>>>>>  
>>>>>>        assert(inst->group % 8 == 0 ||
>>>>>> +             (inst->exec_size == 4 && inst->group % 4 == 0) ||
>>>>>>               inst->dst.type == BRW_REGISTER_TYPE_DF ||
>>>>>>               inst->src[0].type == BRW_REGISTER_TYPE_DF ||
>>>>>>               inst->src[1].type == BRW_REGISTER_TYPE_DF ||
>>>>>> -- 
>>>>>> 2.11.0
>>>>>>
>>>>>> _______________________________________________
>>>>>> mesa-dev mailing list
>>>>>> mesa-dev at lists.freedesktop.org
>>>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 862 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20170309/68fa85ec/attachment.sig>