[Mesa-dev] [PATCH] i965/fs: Don't disable SIMD16 when using the pixel interpolator

Tue Jul 7 08:56:03 PDT 2015

Matt Turner <mattst88 at gmail.com> writes:

> On Sun, Jul 5, 2015 at 4:45 PM, Francisco Jerez <currojerez at riseup.net> wrote:
>> Hi Matt,
>>
>> Matt Turner <mattst88 at gmail.com> writes:
>>
>>> On Fri, Jul 3, 2015 at 3:46 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>>>> Heh, I happened to come across this comment yesterday while looking for
>>>> the remaining no16 calls and wondered why on earth it couldn't do the
>>>> same that the normal interpolation code does.  After this patch and a
>>>> series coming up that will remove all SIMD8 fallbacks from the texturing
>>>> code, the only case left still applicable to Gen7 hardware and later
>>>> will be "SIMD16 explicit accumulator operands unsupported".  Anyone?
>>>
>>> I can explain the problem:
>>>
>>> Prior to Gen7, the were were two accumulator registers usable for most
>>> datatypes (acc0, acc1). On Gen7, they removed integer-support from
>>> acc1, which was necessary to implement SIMD16 integer multiplication
>>> using the normal MUL/MACH sequence.
>>
>> IIRC they got rid of the acc1 register on IVB altogether, but managed to
>> emulate it for floating point types by taking advantage of the extra
>> precision not normally used for floating point arithmetic (the fake acc1
>> basically uses the same storage in the EU that holds the 32 MSBs of each
>> component of acc0), what explains the apparent asymmetry between integer
>> and floating point data types.
>
> I've never read anything that told me that -- what have you seen?

Heh, I'll try to dig up my reference and send it to you in private.

>
>>> I implemented 32-bit integer multiplication without using the
>>> accumulator in:
>>>
>>> commit f7df169ba13d22338e9276839a7e9629ca0a6b4f
>>> Author: Matt Turner <mattst88 at gmail.com>
>>> Date:   Wed May 13 18:34:03 2015 -0700
>>>
>>>     i965/fs: Implement integer multiply without mul/mach.
>>>
>>> The remaining cases of "SIMD16 explicit accumulator operands
>>> unsupported" are ADDC, SUBB, and 32x32 -> high 32-bit multiplication.
>>> The remaining multiplication case can probably be reimplemented
>>> without the accumulator, like I did for the low 32-bit result.
>>>
>> Hmm, I have the suspicion that high 32-bit multiplication is the one
>> legit use-case of the accumulator we have left, any algorithm breaking
>> it up into individual 32/16-bit MULs would end up doing more
>> multiplications than the two MUL/MACH instructions we do now, because we
>> wouldn't be able to take advantage of the full precision implemented in
>> the hardware if we truncate the 48-bit intermediate results to fit in a
>> 32-bit register.
>
> That's probably true. It's just that Sandybridge and earlier don't
> expose the functionality (but could do 64-bit integer multiplication
> just fine), Ivybridge has the quarter-control/accumulator bug, Haswell
> works fine if you split the multiplication sequence into SIMD8, and
> Broadwell let's you do 32x32 -> 64-bit multiplication without the
> accumulator.
>
> So you have only two platforms where it's you have to use the
> accumulator, and one of them is broken (but I guess can be trivially
> fixed by some force-writemask-all hackery).
>

I guess there's also VLV, CHV and BXT, AFAIK the latter two have some
level of support for 64-bit multiplication (with the annoying alignment
restriction on the operands) but it might be easier for them to use the
accumulator path like earlier hardware.

> The best SIMD16 code for [iu]mulExtended() where both lsb and msb
> results are used is probably 2 sets of mul/mach/mov (with some kind of
> work around for Ivybridge), but that's kind of hard to recognize.
>
It's probably also the best SIMD16 code (on chips without reasonable
support for 64-bit multiply that is) for computing the high 32 bits of
the result, regardless of whether optimizer is able to recognise that
the low 32 bits of the computation also come out as a side product, and
whether or not the low 32 bits are used by the shader.

A potential solution could be to have the visitor emit full 64-bit MULs
speculatively for any 32-bit integer multiplication (high or low),
together with a MOV to chop off the unnecessary bits, a later
optimization pass (run after CSE to give the optimizer the opportunity
to merge the 64-bits MULs from the high and low 32-bit computations)
would demote 64-bit MULs for which only the lowest 32-bits of the result
are used to 32-bit MULs, later on the SIMD width lowering pass would
split 16-wide 64-bit MULs in half, and a later pass would lower them
into the MUL/MACH sequence on platforms that don't support full 64-bit
MULs natively.

Not sure if it's worth doing at this point.  I can have a look into
implementing the lowering pass for 64-bit MULs so we can start taking
advantage of the SIMD width lowering pass and get rid of the no16() call
right away, but the additional optimization pass to demote 64-bit MULs
(and speculative emission of 64-bit MULs from the visitor) can probably
wait until we have some use-case?

>> How about we use the SIMD width lowering pass to split the computation
>> in half?  It should be quite straightforward but will probably require
>> adding a new virtual opcode so that the SIMD width lowering pass doesn't
>> have to deal with (seriously fucked-up) accumulators directly.
>
> Seems fine to me.
>
>>> The ADDC and SUBB instructions implicitly write a bit to the
>>> accumulator if their operations overflowed. The 1Q/2Q quarter control
>>> is supposed to select which register is implicitly written -- except
>>> that there is no acc1 for integer types. Haswell and newer ignore the
>>> quarter control and always write acc0, but IVB (and presumably BYT)
>>> attempt to write to the nonexistent acc1.
>>>
>>> You could split the the SIMD16 operations into 2x SIMD8s and set
>>> force_writemask_all on the second, followed by a 2Q MOV from the
>>> accumulator. Maybe we'd rather use the .o (overflow) conditional mod
>>> on a result ADD to implement this.
>>>
>> Yeah.  I did in fact try to implement uaddCarry last Friday without
>> using the accumulator by doing something like:
>>
>> | CMP.o tmp, src0, -src1
>> | MOV dst, -tmp
>>
>> ...what of course didn't work because of the extra argument precision
>> post-source modifiers and also because the .o condmod doesn't work at
>> all on CMP, but...
>
> Ah, you were trying to use the fact that CMP returns 0/-1. That's a
> cool idea. It's too bad that the CMP instruction doesn't do .o
>
> I'd been thinking of doing "ADD.o tmp, src0, src1" and then something
> that sets/selects 0/1 based on the flag register. Maybe even a move
> from the flag register would be best.

Hm, what do you mean by a move?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20150707/d95dddf0/attachment.sig>