[Mesa-dev] [PATCH] i965/fs: Don't disable SIMD16 when using the pixel interpolator

Mon Jul 6 16:01:19 PDT 2015

On Sun, Jul 5, 2015 at 4:45 PM, Francisco Jerez <currojerez at riseup.net> wrote:
> Hi Matt,
>
> Matt Turner <mattst88 at gmail.com> writes:
>
>> On Fri, Jul 3, 2015 at 3:46 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>>> Heh, I happened to come across this comment yesterday while looking for
>>> the remaining no16 calls and wondered why on earth it couldn't do the
>>> same that the normal interpolation code does.  After this patch and a
>>> series coming up that will remove all SIMD8 fallbacks from the texturing
>>> code, the only case left still applicable to Gen7 hardware and later
>>> will be "SIMD16 explicit accumulator operands unsupported".  Anyone?
>>
>> I can explain the problem:
>>
>> Prior to Gen7, the were were two accumulator registers usable for most
>> datatypes (acc0, acc1). On Gen7, they removed integer-support from
>> acc1, which was necessary to implement SIMD16 integer multiplication
>> using the normal MUL/MACH sequence.
>
> IIRC they got rid of the acc1 register on IVB altogether, but managed to
> emulate it for floating point types by taking advantage of the extra
> precision not normally used for floating point arithmetic (the fake acc1
> basically uses the same storage in the EU that holds the 32 MSBs of each
> component of acc0), what explains the apparent asymmetry between integer
> and floating point data types.

I've never read anything that told me that -- what have you seen?

>> I implemented 32-bit integer multiplication without using the
>> accumulator in:
>>
>> commit f7df169ba13d22338e9276839a7e9629ca0a6b4f
>> Author: Matt Turner <mattst88 at gmail.com>
>> Date:   Wed May 13 18:34:03 2015 -0700
>>
>>     i965/fs: Implement integer multiply without mul/mach.
>>
>> The remaining cases of "SIMD16 explicit accumulator operands
>> unsupported" are ADDC, SUBB, and 32x32 -> high 32-bit multiplication.
>> The remaining multiplication case can probably be reimplemented
>> without the accumulator, like I did for the low 32-bit result.
>>
> Hmm, I have the suspicion that high 32-bit multiplication is the one
> legit use-case of the accumulator we have left, any algorithm breaking
> it up into individual 32/16-bit MULs would end up doing more
> multiplications than the two MUL/MACH instructions we do now, because we
> wouldn't be able to take advantage of the full precision implemented in
> the hardware if we truncate the 48-bit intermediate results to fit in a
> 32-bit register.

That's probably true. It's just that Sandybridge and earlier don't
expose the functionality (but could do 64-bit integer multiplication
just fine), Ivybridge has the quarter-control/accumulator bug, Haswell
works fine if you split the multiplication sequence into SIMD8, and
Broadwell let's you do 32x32 -> 64-bit multiplication without the
accumulator.

So you have only two platforms where it's you have to use the
accumulator, and one of them is broken (but I guess can be trivially
fixed by some force-writemask-all hackery).

The best SIMD16 code for [iu]mulExtended() where both lsb and msb
results are used is probably 2 sets of mul/mach/mov (with some kind of
work around for Ivybridge), but that's kind of hard to recognize.

> How about we use the SIMD width lowering pass to split the computation
> in half?  It should be quite straightforward but will probably require
> adding a new virtual opcode so that the SIMD width lowering pass doesn't
> have to deal with (seriously fucked-up) accumulators directly.

Seems fine to me.

>> The ADDC and SUBB instructions implicitly write a bit to the
>> accumulator if their operations overflowed. The 1Q/2Q quarter control
>> is supposed to select which register is implicitly written -- except
>> that there is no acc1 for integer types. Haswell and newer ignore the
>> quarter control and always write acc0, but IVB (and presumably BYT)
>> attempt to write to the nonexistent acc1.
>>
>> You could split the the SIMD16 operations into 2x SIMD8s and set
>> force_writemask_all on the second, followed by a 2Q MOV from the
>> accumulator. Maybe we'd rather use the .o (overflow) conditional mod
>> on a result ADD to implement this.
>>
> Yeah.  I did in fact try to implement uaddCarry last Friday without
> using the accumulator by doing something like:
>
> | CMP.o tmp, src0, -src1
> | MOV dst, -tmp
>
> ...what of course didn't work because of the extra argument precision
> post-source modifiers and also because the .o condmod doesn't work at
> all on CMP, but...

Ah, you were trying to use the fact that CMP returns 0/-1. That's a
cool idea. It's too bad that the CMP instruction doesn't do .o

I'd been thinking of doing "ADD.o tmp, src0, src1" and then something
that sets/selects 0/1 based on the flag register. Maybe even a move
from the flag register would be best.