[Mesa-dev] [PATCH v3 16/24] i965/vec4: fix SIMD-width lowering for VEC4_OPCODE_FROM_DOUBLE in IVB/BYT

Tue Mar 7 21:24:34 UTC 2017

Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:

> On 04/03/17 01:04, Francisco Jerez wrote:
>> Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
>> 
>>> From: "Juan A. Suarez Romero" <jasuarez at igalia.com>
>>>
>>> When splitting VEC4_OPCODE_FROM_DOUBLE in Ivybridge/Baytrail, the second
>>> part should use a temporal register, and then move the values to the
>>> second half of the original destination, so we get all the results in the
>>> same register.
>>>
>>> v2:
>>> - Fix typos (Matt).
>>> ---
>>>  src/mesa/drivers/dri/i965/brw_vec4.cpp           | 17 +++++++++++++----
>>>  src/mesa/drivers/dri/i965/brw_vec4_generator.cpp |  1 +
>>>  2 files changed, 14 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/src/mesa/drivers/dri/i965/brw_vec4.cpp b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>> index 64b435f3ec4..adcde085305 100644
>>> --- a/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>> +++ b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>> @@ -2191,9 +2191,15 @@ vec4_visitor::lower_simd_width()
>>>           linst->group = channel_offset;
>>>           linst->size_written = size_written;
>>>  
>>> +         /* When splitting VEC4_OPCODE_FROM_DOUBLE on Ivybridge, the second part
>>> +          * should use in a temporal register. Later we will move the values
>>> +          * to the second half of the original destination, so we get all the
>>> +          * results in the same register. We use d2f_pass to detect this case.
>>> +          */
>>> +         bool d2f_pass = (inst->opcode == VEC4_OPCODE_FROM_DOUBLE && n > 0);
>>>           /* Compute split dst region */
>>>           dst_reg dst;
>>> -         if (needs_temp) {
>>> +         if (needs_temp || d2f_pass) {
>>>              unsigned num_regs = DIV_ROUND_UP(size_written, REG_SIZE);
>>>              dst = retype(dst_reg(VGRF, alloc.allocate(num_regs)),
>>>                           inst->dst.type);
>>> @@ -2226,9 +2232,12 @@ vec4_visitor::lower_simd_width()
>>>           /* If we used a temporary to store the result of the split
>>>            * instruction, copy the result to the original destination
>>>            */
>>> -         if (needs_temp) {
>>> -            vec4_instruction *mov =
>>> -               MOV(offset(inst->dst, lowered_width, n), src_reg(dst));
>>> +         if (needs_temp || d2f_pass) {
>>> +            vec4_instruction *mov;
>>> +            if (d2f_pass)
>>> +               mov = MOV(horiz_offset(inst->dst, n * type_sz(inst->dst.type)), src_reg(dst));
>> 
>> I have no idea how this could possibly work...  horiz_offset() expects a
>> number of scalar components, not bytes.  Anyway I have a hunch this is
>> trying to workaround the bug I pointed out in PATCH 15...
>> 
>
> This worked by luck. We want an horizontal offset of 'lowered_width'
> components per split. As inst->dst.type is F, type_sz(inst->dst.type) =
> 4 and we set the maximum 'lowered_width' value to 4, it matches in the
> cases we tested... however, this is a bug.
>
> Actually it works too if we use offset() instead, so this hunk is not
> needed.
>

This patch as a whole makes no sense to me...  It's working around the
dubious behavior of VEC4_OPCODE_FROM_DOUBLE which doesn't behave like
any other arithmetic instruction of the back-end because it corrupts
data beyond its destination region, which means any reasoning done about
it within the optimizer is suspect of leading to data corruption in the
same way that you got the SIMD lowering pass to inadvertently cause
corruption.  The problem IMO is that you're pretending that something
that doesn't behave like a back-end instruction is an instruction.  I
suggest you split the datatype conversion from the gathering/scattering
of 32-bit fields as separate instructions -- Turns out that you already
have the latter implemented as VEC4_OPCODE_PICK/SET_LOW_32BIT.

Incidentally doing things that way would allow better performance
because the two operations would become visible to the scheduler and an
EU pipeline stall could potentially be avoided (though performance is by
no means my primary motivation for NAKing this patch, but rather
preserving our sanity in the long run...).

> Sam
>
>>> +            else
>>> +               mov = MOV(offset(inst->dst, lowered_width, n), src_reg(dst));
>>>              mov->exec_size = lowered_width;
>>>              mov->group = channel_offset;
>>>              mov->size_written = size_written;
>>> diff --git a/src/mesa/drivers/dri/i965/brw_vec4_generator.cpp b/src/mesa/drivers/dri/i965/brw_vec4_generator.cpp
>>> index 7fa1afc9073..b570792badd 100644
>>> --- a/src/mesa/drivers/dri/i965/brw_vec4_generator.cpp
>>> +++ b/src/mesa/drivers/dri/i965/brw_vec4_generator.cpp
>>> @@ -1532,6 +1532,7 @@ generate_code(struct brw_codegen *p,
>>>               is_ivb_df);
>>>  
>>>        assert(inst->group % 8 == 0 ||
>>> +             (inst->exec_size == 4 && inst->group % 4 == 0) ||
>>>               inst->dst.type == BRW_REGISTER_TYPE_DF ||
>>>               inst->src[0].type == BRW_REGISTER_TYPE_DF ||
>>>               inst->src[1].type == BRW_REGISTER_TYPE_DF ||
>>> -- 
>>> 2.11.0
>>>
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20170307/da695e83/attachment.sig>