[Mesa-dev] The i965 vec4 backend, exec_masks, and 64-bit types

Tue Nov 3 17:04:41 PST 2015

Francisco Jerez <currojerez at riseup.net> writes:

> Connor Abbott <cwabbott0 at gmail.com> writes:
>
>> Hi all,
>>
>> While working on FP64 for i965, there's an issue that I thought of
>> with the vec4 backend that I'm not sure how to resolve. From what I
>> understand, the execmask works the same way in Align16 mode as Align1
>> mode, except that you only use the first 8 channels in practice for
>> SIMD4x2, and the first four channels are always the same as well as
>> the last 4 channels. But this doesn't work for 64-bit things, since
>> there we only operate on 4 components at the same time, so it's more
>> like SIMD2x2. For example, imagine that only the second vertex is
>> currently enabled at the moment. Then the execmask looks like
>> 00001111, and if we do something like:
>>
>> mul(4)          g24<1>DF     g12<4,4,1>DF g13<4,4,1>DF { align16 };
>>
>> then all 4 channels will be disabled, which is not what we want.
>>
> AFAIUI this shouldn't be a problem.  In align16 mode each component of
> an instruction with double-precision execution type maps to *two* bits
> of the execmask instead of one (one for each 32-bit half), which is
> compensated by each logical thread having two components instead of
> four, so in your example [assuming 00001111 is little-endian notation
> and you actually do 'mul(8)' ;)] the x and y components of the first
> logical thread will be disabled while the x and y components of the
> second logical thread will be enabled.
>

I've had a look into the simulator's behaviour, and in fact HSW+ seem to
sort of support actual SIMD4x2 on DF types, so when you do stuff like

| mul(8)   g24.xyzw:df    g12<4>.xyzw:df  g12<4>.xyzw:df { align16 };

it will actually write 8 double floats to g24-25 (using a nibble from
the execmask for each vec4), what contradicts the hardware spec:

| IVB+
| 
| In Align16 mode, all regioning parameters must use the syntax of a pair
| of packed floats, including channel selects and channel enables.
| 
| // Example:
| mov (8) r10.0.xyzw:df r11.0.xyzw:df
| // The above instruction moves four double floats. The .x picks the
| // low 32 bits and the .y picks the high 32 bits of the double float.

(I believe the quotation above may only apply to IVB even though it's
 marked IVB+).

Now the really weird thing I've noticed: A DF Align16 instruction with
writemask XY will actually write components XZ of each vec4, and
writemask ZW actually writes components YW (!).  Other writemasks seem
to behave normally (including all scalar ones).  I haven't found any
mention of this in the docs, but a quick test on real hardware confirms
the simulator's behaviour.

Swizzles OTOH still shuffle individual 32-bit fields and are extended
cyclically into the ZW components of the instruction (how useful).

I wonder if we would be better off scalarizing all FP64 code...

>> I think the first thing to do is to write a piglit test that tests
>> this case, since currently all the arb_gpu_shader_fp64 tests only use
>> uniforms. We need a test that uses non-uniform control flow that
>> triggers the case described above. Once we do that, and if we
>> determine there's actually a problem, then we need to figure out how
>> to solve it.. The ideas I had were:
>>
>
> I guess a piglit test would be nice, but you're unlikely to have to do
> much about it. ;)
>
I think I take my word back, this isn't going to be fun. :P

>> 1. make every FP64 thing use WE_all. This isn't actually too bad at
>> the moment, since our notion of interference already assumes
>> (more-or-less) that everything is WE_all, but it prevents us from
>> improving it in the future with FP64 things. Unfortunately, it also
>> means that we can't use writemasks since setting WE_all makes the EU
>> ignore the writemask, so we'll have to do some trickery to get things
>> with only 1 channel enabled to work correctly.
>>
>> 2. Use the NibCtrl field, and split each FP64 operation into 2.
>> Unfortunately, this field only appeared on gen8, and the PRM only says
>> it works for SIMD4 operations, whereas we need it to work for SIMD2
>> operations, although there's a chance it'll actually work for SIMD2 as
>> well. This lets us potentially do better register allocation, but it
>> might not work and even if it does it won't work for gen7.
>>
> NibCtrl is Gen7+ actually.  I believe that indeed has a good chance of
> working for Align16 2-wide DF instructions but I don't know for sure
> offhand.
>
>> #1 sounds like the better solution for now, but who knows... maybe the
>> HW people magically made it work already, and I'm not aware or they
>> didn't document it.
>>
>> Connor
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20151104/fe50463c/attachment-0001.sig>