[Mesa-dev] [PATCH 00/10] i965: Silly vec4 un/packing optimizations

Fri Oct 24 10:21:44 PDT 2014

On Fri, Oct 24, 2014 at 1:45 AM, Francisco Jerez <currojerez at riseup.net> wrote:
> Matt Turner <mattst88 at gmail.com> writes:
>
>> When I implemented these built-ins couple of years ago, I thought there
>> must be a neat way to optimize them. I tried a couple of things with the
>> different vector immediates i965 provides, but the V/UV types are too
>> small to represent the appropriate shift values, and shift instructions
>> can't shift by a floating-point source (if using the vector float imm).
>>
>> Curro pointed out that I could actually load the integer shift values
>> with VF immediate just by doing a type converting move. How simple.
>>
>> So anyway, these optimizations are of pretty negligible value, except
>> for maybe demonstrating that VF works. I've had them sitting on a branch
>> for months, so time for them to live somewhere else. At least we can
>> disassemble VF immediates now.
>>
>> I hope to have some more uses of VF immediates soon too.
>
> Hi Matt,
>
> a different approach I had in mind was to write an optimization pass
> that would vectorize immediate moves by using VF where the original
> arguments can be represented exactly as an 8-bit float.  That would
> probably help in many more cases than hand-optimizing a couple of
> built-in operations -- That said, I guess it doesn't hurt to do this for
> the time being until we have such an optimization pass.

A pass to emit VF immediates rather than 4x immediate moves seems like
a good plan. Eric tried it (see the vf-immediates branch of his tree)
but he ran into some problems -- specifically what do you do for
constant folding when the result isn't also representable in VF. I
think his approach to emit VF immediates while generating the backend
IR might lend itself to more problems than, say, having an
optimization that runs after everything else and changes some
immediate moves to VF moves.

Unfortunately a pass to do just this wouldn't be anywhere near
sufficient to optimize these built-ins. You'd also need to have a pass
to combine multiple scalar operations into vector operations, which
I've implemented in the GLSL compiler but only for operations on
different components of the same variable.

For these built-ins, the vectorization pass would have to combine
multiple scalar operations operating on /different/ registers, which
sounds like a really hard problem. I've noticed a place this would
help in at least one real vertex shader -- it did this (with some
intervening instructions):

dp3(8)          g14<1>.xF       g9<4,4,1>.xyzzF g9<4,4,1>.xyzzF
dp3(8)          g17<1>.xF       g10<4,4,1>.xyzzF g10<4,4,1>.xyzzF
dp3(8)          g20<1>.xF       g11<4,4,1>.xyzzF g11<4,4,1>.xyzzF
math sqrt(8)    g16<1>F         g14<4,4,1>.xF   null
math sqrt(8)    g25<1>F         g20<4,4,1>.xF   null
math sqrt(8)    g19<1>F         g17<4,4,1>.xF   null

We could have done those (independent!) dp3's into the .xyz channels
of a register and then just done a single sqrt instruction. It would
have cut two instructions, but it also would have added a bunch of
extra dependencies between otherwise independent instructions and
would have lengthened some live ranges.

So what I'm saying is -- yeah, having a pass that optimized what we
have now into what we have after this series in a generic way would be
great! Unfortunately it would also be an immense amount of work that
might not end up being anything more than an open ended research
project. If only I could find my magic wand...