[Mesa-dev] The i965 vec4 backend, exec_masks, and 64-bit types

Sun Nov 8 11:34:56 PST 2015

On Tue, Nov 3, 2015 at 8:04 PM, Francisco Jerez <currojerez at riseup.net> wrote:
> Francisco Jerez <currojerez at riseup.net> writes:
>
>> Connor Abbott <cwabbott0 at gmail.com> writes:
>>
>>> Hi all,
>>>
>>> While working on FP64 for i965, there's an issue that I thought of
>>> with the vec4 backend that I'm not sure how to resolve. From what I
>>> understand, the execmask works the same way in Align16 mode as Align1
>>> mode, except that you only use the first 8 channels in practice for
>>> SIMD4x2, and the first four channels are always the same as well as
>>> the last 4 channels. But this doesn't work for 64-bit things, since
>>> there we only operate on 4 components at the same time, so it's more
>>> like SIMD2x2. For example, imagine that only the second vertex is
>>> currently enabled at the moment. Then the execmask looks like
>>> 00001111, and if we do something like:
>>>
>>> mul(4)          g24<1>DF     g12<4,4,1>DF g13<4,4,1>DF { align16 };
>>>
>>> then all 4 channels will be disabled, which is not what we want.
>>>
>> AFAIUI this shouldn't be a problem.  In align16 mode each component of
>> an instruction with double-precision execution type maps to *two* bits
>> of the execmask instead of one (one for each 32-bit half), which is
>> compensated by each logical thread having two components instead of
>> four, so in your example [assuming 00001111 is little-endian notation
>> and you actually do 'mul(8)' ;)] the x and y components of the first
>> logical thread will be disabled while the x and y components of the
>> second logical thread will be enabled.

That certainly makes sense... I just couldn't find a doc reference to
confirm or deny it.

>>
>
> I've had a look into the simulator's behaviour, and in fact HSW+ seem to
> sort of support actual SIMD4x2 on DF types, so when you do stuff like
>
> | mul(8)   g24.xyzw:df    g12<4>.xyzw:df  g12<4>.xyzw:df { align16 };
>
> it will actually write 8 double floats to g24-25 (using a nibble from
> the execmask for each vec4), what contradicts the hardware spec:
>
> | IVB+
> |
> | In Align16 mode, all regioning parameters must use the syntax of a pair
> | of packed floats, including channel selects and channel enables.
> |
> | // Example:
> | mov (8) r10.0.xyzw:df r11.0.xyzw:df
> | // The above instruction moves four double floats. The .x picks the
> | // low 32 bits and the .y picks the high 32 bits of the double float.
>
> (I believe the quotation above may only apply to IVB even though it's
>  marked IVB+).

Thanks for looking into this. Indeed, at least on BDW the exec_size
does need to be divided by 2 (I have a patch on my branch that does
this, and it fixed a number of piglit tests). That's why the example I
wrote had an exec_size of 4.

>
> Now the really weird thing I've noticed: A DF Align16 instruction with
> writemask XY will actually write components XZ of each vec4, and
> writemask ZW actually writes components YW (!).  Other writemasks seem
> to behave normally (including all scalar ones).  I haven't found any
> mention of this in the docs, but a quick test on real hardware confirms
> the simulator's behaviour.

Ugh, really... that sucks :/

>
> Swizzles OTOH still shuffle individual 32-bit fields and are extended
> cyclically into the ZW components of the instruction (how useful).
>
> I wonder if we would be better off scalarizing all FP64 code...

Yeah, maybe we could get away with putting each component into a
separate register, and always using XYZW writemasks... but we'd still
need to pack two things into a single dvec2 for e.g. SSBO's, so it
wouldn't work there. We don't support them today, although I'm still
not 100% sure we can always get rid of all the packing operations...
and relying on the optimizations to get rid of them seems kinda
fragile. We could make dvec2() work using normal 32-bit MOV's,
although at that point it might be easier not to scalarize and instead
have double operations output to temporaries and then use a 32-bit MOV
to apply the right writemask.

>
>>> I think the first thing to do is to write a piglit test that tests
>>> this case, since currently all the arb_gpu_shader_fp64 tests only use
>>> uniforms. We need a test that uses non-uniform control flow that
>>> triggers the case described above. Once we do that, and if we
>>> determine there's actually a problem, then we need to figure out how
>>> to solve it.. The ideas I had were:
>>>
>>
>> I guess a piglit test would be nice, but you're unlikely to have to do
>> much about it. ;)
>>
> I think I take my word back, this isn't going to be fun. :P

hehe :)

>
>>> 1. make every FP64 thing use WE_all. This isn't actually too bad at
>>> the moment, since our notion of interference already assumes
>>> (more-or-less) that everything is WE_all, but it prevents us from
>>> improving it in the future with FP64 things. Unfortunately, it also
>>> means that we can't use writemasks since setting WE_all makes the EU
>>> ignore the writemask, so we'll have to do some trickery to get things
>>> with only 1 channel enabled to work correctly.
>>>
>>> 2. Use the NibCtrl field, and split each FP64 operation into 2.
>>> Unfortunately, this field only appeared on gen8, and the PRM only says
>>> it works for SIMD4 operations, whereas we need it to work for SIMD2
>>> operations, although there's a chance it'll actually work for SIMD2 as
>>> well. This lets us potentially do better register allocation, but it
>>> might not work and even if it does it won't work for gen7.
>>>
>> NibCtrl is Gen7+ actually.  I believe that indeed has a good chance of
>> working for Align16 2-wide DF instructions but I don't know for sure
>> offhand.
>>
>>> #1 sounds like the better solution for now, but who knows... maybe the
>>> HW people magically made it work already, and I'm not aware or they
>>> didn't document it.
>>>
>>> Connor
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev