[Mesa-dev] [PATCH] i965/fs: Don't disable SIMD16 when using the pixel interpolator

Sun Jul 5 16:45:59 PDT 2015

Hi Matt,

Matt Turner <mattst88 at gmail.com> writes:

> On Fri, Jul 3, 2015 at 3:46 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>> Heh, I happened to come across this comment yesterday while looking for
>> the remaining no16 calls and wondered why on earth it couldn't do the
>> same that the normal interpolation code does.  After this patch and a
>> series coming up that will remove all SIMD8 fallbacks from the texturing
>> code, the only case left still applicable to Gen7 hardware and later
>> will be "SIMD16 explicit accumulator operands unsupported".  Anyone?
>
> I can explain the problem:
>
> Prior to Gen7, the were were two accumulator registers usable for most
> datatypes (acc0, acc1). On Gen7, they removed integer-support from
> acc1, which was necessary to implement SIMD16 integer multiplication
> using the normal MUL/MACH sequence.

IIRC they got rid of the acc1 register on IVB altogether, but managed to
emulate it for floating point types by taking advantage of the extra
precision not normally used for floating point arithmetic (the fake acc1
basically uses the same storage in the EU that holds the 32 MSBs of each
component of acc0), what explains the apparent asymmetry between integer
and floating point data types.

> I implemented 32-bit integer multiplication without using the
> accumulator in:
>
> commit f7df169ba13d22338e9276839a7e9629ca0a6b4f
> Author: Matt Turner <mattst88 at gmail.com>
> Date:   Wed May 13 18:34:03 2015 -0700
>
>     i965/fs: Implement integer multiply without mul/mach.
>
> The remaining cases of "SIMD16 explicit accumulator operands
> unsupported" are ADDC, SUBB, and 32x32 -> high 32-bit multiplication.
> The remaining multiplication case can probably be reimplemented
> without the accumulator, like I did for the low 32-bit result.
>
Hmm, I have the suspicion that high 32-bit multiplication is the one
legit use-case of the accumulator we have left, any algorithm breaking
it up into individual 32/16-bit MULs would end up doing more
multiplications than the two MUL/MACH instructions we do now, because we
wouldn't be able to take advantage of the full precision implemented in
the hardware if we truncate the 48-bit intermediate results to fit in a
32-bit register.

How about we use the SIMD width lowering pass to split the computation
in half?  It should be quite straightforward but will probably require
adding a new virtual opcode so that the SIMD width lowering pass doesn't
have to deal with (seriously fucked-up) accumulators directly.

> The ADDC and SUBB instructions implicitly write a bit to the
> accumulator if their operations overflowed. The 1Q/2Q quarter control
> is supposed to select which register is implicitly written -- except
> that there is no acc1 for integer types. Haswell and newer ignore the
> quarter control and always write acc0, but IVB (and presumably BYT)
> attempt to write to the nonexistent acc1.
>
> You could split the the SIMD16 operations into 2x SIMD8s and set
> force_writemask_all on the second, followed by a 2Q MOV from the
> accumulator. Maybe we'd rather use the .o (overflow) conditional mod
> on a result ADD to implement this.
>
Yeah.  I did in fact try to implement uaddCarry last Friday without
using the accumulator by doing something like:

| CMP.o tmp, src0, -src1
| MOV dst, -tmp

...what of course didn't work because of the extra argument precision
post-source modifiers and also because the .o condmod doesn't work at
all on CMP, but...

> Ideally, we'd recognize merge the addition and carry operations into a
> single ADDC instruction, but it's pretty unimportant. It's all pretty
> academic -- I've never seen an application use either operation (or
> [iu]mulExtended either).

...if we did the following instead:

| ADD tmp, src0, src1
| CMP.l tmp, tmp, src0
| MOV dst, -tmp

the ADD could be easily CSE'ed with the original ADD instruction (and
the source modifier of the last MOV can also be easily propagated into
some other instruction), so even though it seems like one instruction
more than what we emit now it might be a net win (aside from it working
on SIMD16).  usubBorrow is even easier:

| CMP.l tmp, src0, src1
| MOV dst, -tmp

I was planning to run it through shader-db tomorrow but if you say
you've never seen them used I guess I shouldn't get my hopes too high? :P
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20150706/614565bc/attachment.sig>