[Mesa-dev] The i965 vec4 backend, exec_masks, and 64-bit types

Wed Nov 18 06:08:24 PST 2015

Connor Abbott <cwabbott0 at gmail.com> writes:

> On Tue, Nov 3, 2015 at 8:04 PM, Francisco Jerez <currojerez at riseup.net> wrote:
>> Francisco Jerez <currojerez at riseup.net> writes:
>>
>>> Connor Abbott <cwabbott0 at gmail.com> writes:
>>>
>>>> Hi all,
>>>>
>>>> While working on FP64 for i965, there's an issue that I thought of
>>>> with the vec4 backend that I'm not sure how to resolve. From what I
>>>> understand, the execmask works the same way in Align16 mode as Align1
>>>> mode, except that you only use the first 8 channels in practice for
>>>> SIMD4x2, and the first four channels are always the same as well as
>>>> the last 4 channels. But this doesn't work for 64-bit things, since
>>>> there we only operate on 4 components at the same time, so it's more
>>>> like SIMD2x2. For example, imagine that only the second vertex is
>>>> currently enabled at the moment. Then the execmask looks like
>>>> 00001111, and if we do something like:
>>>>
>>>> mul(4)          g24<1>DF     g12<4,4,1>DF g13<4,4,1>DF { align16 };
>>>>
>>>> then all 4 channels will be disabled, which is not what we want.
>>>>
>>> AFAIUI this shouldn't be a problem.  In align16 mode each component of
>>> an instruction with double-precision execution type maps to *two* bits
>>> of the execmask instead of one (one for each 32-bit half), which is
>>> compensated by each logical thread having two components instead of
>>> four, so in your example [assuming 00001111 is little-endian notation
>>> and you actually do 'mul(8)' ;)] the x and y components of the first
>>> logical thread will be disabled while the x and y components of the
>>> second logical thread will be enabled.
>
> That certainly makes sense... I just couldn't find a doc reference to
> confirm or deny it.
>
>>>
>>
>> I've had a look into the simulator's behaviour, and in fact HSW+ seem to
>> sort of support actual SIMD4x2 on DF types, so when you do stuff like
>>
>> | mul(8)   g24.xyzw:df    g12<4>.xyzw:df  g12<4>.xyzw:df { align16 };
>>
>> it will actually write 8 double floats to g24-25 (using a nibble from
>> the execmask for each vec4), what contradicts the hardware spec:
>>
>> | IVB+
>> |
>> | In Align16 mode, all regioning parameters must use the syntax of a pair
>> | of packed floats, including channel selects and channel enables.
>> |
>> | // Example:
>> | mov (8) r10.0.xyzw:df r11.0.xyzw:df
>> | // The above instruction moves four double floats. The .x picks the
>> | // low 32 bits and the .y picks the high 32 bits of the double float.
>>
>> (I believe the quotation above may only apply to IVB even though it's
>>  marked IVB+).
>
> Thanks for looking into this. Indeed, at least on BDW the exec_size
> does need to be divided by 2 (I have a patch on my branch that does
> this, and it fixed a number of piglit tests). That's why the example I
> wrote had an exec_size of 4.
>

Uhm...  The thing is that on HSW+ you get 4 actual FP64 channels per
vertex, so if you set the execution size to 4 only the channels of the
first vertex will be executed and you'll definitely run into the problem
you described in your original e-mail.  IOW the execution size needs to
be 8 on HSW+ for the channel enables to be applied correctly unless you
use NoMask and apply the channel enables later on using moves, or split
the instruction in half and use NibCtrl to select the right channel
enable signals as you suggested earlier.

(Sorry for the late reply BTW, I was on vacation last week).

>>
>> Now the really weird thing I've noticed: A DF Align16 instruction with
>> writemask XY will actually write components XZ of each vec4, and
>> writemask ZW actually writes components YW (!).  Other writemasks seem
>> to behave normally (including all scalar ones).  I haven't found any
>> mention of this in the docs, but a quick test on real hardware confirms
>> the simulator's behaviour.
>
> Ugh, really... that sucks :/
>
>>
>> Swizzles OTOH still shuffle individual 32-bit fields and are extended
>> cyclically into the ZW components of the instruction (how useful).
>>
>> I wonder if we would be better off scalarizing all FP64 code...
>
> Yeah, maybe we could get away with putting each component into a
> separate register, and always using XYZW writemasks... but we'd still
> need to pack two things into a single dvec2 for e.g. SSBO's, so it
> wouldn't work there. We don't support them today, although I'm still
> not 100% sure we can always get rid of all the packing operations...
> and relying on the optimizations to get rid of them seems kinda
> fragile. We could make dvec2() work using normal 32-bit MOV's,
> although at that point it might be easier not to scalarize and instead
> have double operations output to temporaries and then use a 32-bit MOV
> to apply the right writemask.
>
Relying on optimizations to get rid of packing sounds reasonable to me,
the packing could be done using an Align1 move like:

 mov (8) r0.0<1>:d r1.0<4,2,1>:d

Another alternative would be to emit actual dvec4 instructions (with any
unnecessary components masked out), and lower XY arithmetic like:

 mul (8) r0.0.xy:df r2.0<4>.xyzw:df r4.0<4>.xyzw:df { align16 }

into the same instruction with the Y and Z coordinates permuted.  The
permutation of the sources and destination can be done destructively
with something like (e.g. for the r2 source):

 mov (8) r2.4.xyzw:d r2.4<4>.zwxy:d

or non-destructively like:

 mov (8) r6.0.xw:df r2.0<4>.xyzw:df
 mov (8) r6.4.xyzw:d r2.4<4>.zwxy:d

A simple peephole optimization could recognise that the double
application of the permutation instruction is equivalent to the
identity.

>>
>>>> I think the first thing to do is to write a piglit test that tests
>>>> this case, since currently all the arb_gpu_shader_fp64 tests only use
>>>> uniforms. We need a test that uses non-uniform control flow that
>>>> triggers the case described above. Once we do that, and if we
>>>> determine there's actually a problem, then we need to figure out how
>>>> to solve it.. The ideas I had were:
>>>>
>>>
>>> I guess a piglit test would be nice, but you're unlikely to have to do
>>> much about it. ;)
>>>
>> I think I take my word back, this isn't going to be fun. :P
>
> hehe :)
>
>>
>>>> 1. make every FP64 thing use WE_all. This isn't actually too bad at
>>>> the moment, since our notion of interference already assumes
>>>> (more-or-less) that everything is WE_all, but it prevents us from
>>>> improving it in the future with FP64 things. Unfortunately, it also
>>>> means that we can't use writemasks since setting WE_all makes the EU
>>>> ignore the writemask, so we'll have to do some trickery to get things
>>>> with only 1 channel enabled to work correctly.
>>>>
>>>> 2. Use the NibCtrl field, and split each FP64 operation into 2.
>>>> Unfortunately, this field only appeared on gen8, and the PRM only says
>>>> it works for SIMD4 operations, whereas we need it to work for SIMD2
>>>> operations, although there's a chance it'll actually work for SIMD2 as
>>>> well. This lets us potentially do better register allocation, but it
>>>> might not work and even if it does it won't work for gen7.
>>>>
>>> NibCtrl is Gen7+ actually.  I believe that indeed has a good chance of
>>> working for Align16 2-wide DF instructions but I don't know for sure
>>> offhand.
>>>
>>>> #1 sounds like the better solution for now, but who knows... maybe the
>>>> HW people magically made it work already, and I'm not aware or they
>>>> didn't document it.
>>>>
>>>> Connor
>>>> _______________________________________________
>>>> mesa-dev mailing list
>>>> mesa-dev at lists.freedesktop.org
>>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20151118/960df797/attachment.sig>