[Mesa-dev] [PATCH 7/9] i965/vec4: avoid dependency control around Align1 instructions

Thu Nov 19 11:07:48 PST 2015

On Thu, Nov 19, 2015 at 1:54 PM, Matt Turner <mattst88 at gmail.com> wrote:
> On Thu, Nov 19, 2015 at 7:31 AM, Connor Abbott <cwabbott0 at gmail.com> wrote:
>> On Thu, Nov 19, 2015 at 6:40 AM, Matt Turner <mattst88 at gmail.com> wrote:
>>> On Thu, Nov 19, 2015 at 2:05 AM, Iago Toral Quiroga <itoral at igalia.com> wrote:
>>>> From: Connor Abbott <connor.w.abbott at intel.com>
>>>>
>>>> It appears that not only math instructions, but also MOV_BYTES or
>>>> any instruction that uses Align1 mode cannot be in the middle
>>>> of a dependency control sequence or the GPU will hang (at least on my
>>>> BDW). This fixes GPU hangs in some fp64 tests.
>>>
>>> I'm pretty surprised by this assessment. Doubtful even.
>>>
>>>> Reviewed-by: Iago Toral Quiroga <itoral at igalia.com>
>>>> ---
>>>>  src/mesa/drivers/dri/i965/brw_vec4.cpp | 17 ++++++++++++-----
>>>>  1 file changed, 12 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/src/mesa/drivers/dri/i965/brw_vec4.cpp b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>>> index 3bcd5cb..bc0a33b 100644
>>>> --- a/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>>> +++ b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>>>> @@ -838,6 +838,17 @@ vec4_visitor::is_dep_ctrl_unsafe(const vec4_instruction *inst)
>>>>     }
>>>>
>>>>     /*
>>>> +    * Instructions that use Align1 mode cause the GPU to hang when inserted
>>>> +    * between a NoDDClr and NoDDChk in Align16 mode. Discovered empirically.
>>>> +    */
>>>> +
>>>> +   if (inst->opcode == VEC4_OPCODE_PACK_BYTES ||
>>>> +       inst->opcode == VEC4_OPCODE_MOV_BYTES ||
>>>
>>> PACK_BYTES sets depctrl itself in the generator, and at the time I
>>> added it I made a test that did
>>>
>>>   vec4 foo = vec4(packUnorm4x8(...),
>>>                   packUnorm4x8(...),
>>>                   packUnorm4x8(...),
>>>                   packUnorm4x8(...))
>>>
>>> and confirmed that it set depctrl properly on the whole sequence.
>>> There could of course be bugs somewhere, but the "hardware doesn't
>>> work if you mix align1 and align16 depctrl" seems unlikely.
>>>
>>> Do you know of a test that this affects?
>>
>> This only affects FP64 tests, since there we use an align1 mov to do
>> double-to-float and float-to-double. However, I tried commenting out
>> emit_nir_code() and just doing essentially:
>>
>> emit(MOV(...))->force_writemask_all = true;
>> emit(VEC4_OPCODE_PACK_BYTES, ...);
>> emit(MOV(...))->force_writemask_all = true;
>>
>> and on my BDW it hanged. In case it's not clear: this isn't about
>> setting depctrl on the instruction itself, it just can't be inside of
>> a depctrl sequence (which we were already disallowing for math
>> instructions anyways).
>
> Very weird. I'll take a look. So I understand, are the MOV
> instructions writing different channels of the same register? And
> VEC4_OPCODE_PACK_BYTES is writing to a different or the same register
> as the MOVs? (I saw your fixup reply)

Actually, I had them writing the same thing so the second overwrote
the first one. The PACK_BYTES/MOV_BYTES/F2D/D2F (I think I tested all
of them, not sure) were operating on completely different registers,
and in the FP64 test that actually hung the GPU they were as well.
Using d2f since it's simpler and I remember what the operands are
(it's just an align1 mov with a dest stride of 2), the test code was
something like:

mov g50, g51 { no_dd_clear }
d2f g52, g54
mov g50, g53 { no_dd_check }

and changing the d2f to a normal align16 mov or commenting it out
prevented the hang. It would be interesting to see if a math
instruction instead of d2f also hangs.

>
> By the way, the math code is too heavy handed as far as I know. The
> BDW+ docs say that the MATH instruction itself cannot take dependency
> control hints (and empirically earlier platforms seem to have problems
> with this as well, see
> tests/shaders/dependency-hints/exp2.shader_test) -- nothing about a
> math instruction being in the middle of a NoDDC* block. The person who
> implemented the math did the minimal amount of work to fix the
> problem.
>
> The PRM also says:
>
> """
> Instructions other than send, may use this control as long as
> operations that have different pipeline latencies are not mixed. The
> operations that have longer latencies are:
>
> Opcodes pln, lrp, dp*.
> Operations involving double precision computation.
> Integer DW multiplication where both source operands are DWs.
> """
>
> I would say that mixing a double-precision operation and something
> else might cause problems, but that seems like we have a different
> problem thus far.

Yeah, these are all just mov's so I would expect that section to
apply. It still seems like we're not taking it into account, though...