[Mesa-dev] [PATCH 00/19] i965/fs: load_payload on Gen7+.

Wed Jun 11 11:59:43 PDT 2014

On Tue, Jun 10, 2014 at 3:34 PM, Matt Turner <mattst88 at gmail.com> wrote:
> On Wed, May 28, 2014 at 4:44 PM, Connor Abbott <cwabbott0 at gmail.com> wrote:
>> On Tue, May 27, 2014 at 9:47 PM, Matt Turner <mattst88 at gmail.com> wrote:
>>> Here's a respin of my load_payload series from mid-April with some
>>> feedback from Ken addressed and some bugs fixed.
>>>
>>> This series is available in my tree (with a few unrelated patches
>>> before it)
>>>
>>>    git://people.freedesktop.org/~mattst88/mesa tex-sources
>>>
>>> This is a prep series for implementing SSA in the i965 fragment
>>> shader backend.
>>
>> I wonder how SSA in the i965 backend will interact with my SSA in GLSL
>> IR series?
>
> I'm not really sure. Of course it'd be nice to not have a bunch of
> similar code in src/glsl and the i965 driver.
>
> I've got a bunch of patches on the cfg branch of my tree
>
>      git://people.freedesktop.org/~mattst88/mesa cfg
>
> that start implementing SSA in the fs backend. It's got code to create
> the dominance tree and dominance frontier sets, to insert Phi nodes,
> and to rename variables. I've also adapted my local value numbering
> pass to operate globally, but of course it requires global code motion
> to perform fix ups.

Nice! The way it works in GLSL IR is much worse, unfortunately,
because we don't have the cfg or the dominance tree (see my
discussions with Eric on that...) so we have to basically fake it in a
really tricky way. Inspired by those difficulties, and by the fact
that many SSA passes (like value numbering) require a flat IR without
expression trees, I took up Ken's suggestion and started to design
another IR that would have all the properties we want. Right now, it's
basically just a header file, but if the experiment works out then
maybe we can consider going GLSL IR -> new IR -> backend instead... it
would require rewriting the code to lower the IR to FS instructions,
but it should be a *lot* easier this time. Anyways, I'd be glad to
discuss all this in person next week.

>
> I haven't really started on translating out of SSA, because it's been
> hard to get interested when performing register allocation from SSA
> seems so appealing. That's obviously a sizeable amount of work.

Yeah... I just got the motivation to read "Register allocation for
programs in SSA-form," and it seems really interesting. Coalescing and
phi node elimination seem difficult though, especially if you don't
have a swap instruction (apparently register allocation is only
NP-complete if you don't have a swap instruction... who'd-a thunk
that!)

>
> Predication is another problem I need to handle. Gated SSA was
> suggested to me as the best solution, with some other nice properties
> that allow things like divergence analysis.

I looked at that, and it doesn't seem very interesting - it doesn't
really do much for us thanks to the extra information that the
structured control flow gives to us. The only interesting thing there
are the alpha statements, which allow one to represent arrays in SSA
form. This isn't a new idea - the same thing was suggested in the
original to-SSA paper by Cytron et. al. where it was called the
"Update" operation - and I'm not sure how useful that is. The problem
is that you may wind up with extra copies of array variables after
conversion out of SSA, in which case you'd have to add a loop to copy
each element of the array, which is a lot more expensive than
introducing copies of non-array variables.

The solution I was thinking of goes something like this: flatten
if-statements in SSA form, adding a predicate to each instruction in
the if (either the condition for the then statements or its inverse
for the else statements) and converting phi nodes to conditional
selects. This should probably be done right before conversion out of
SSA (or register allocation) so other passes don't have to deal with
the predicates. Then, while converting out of SSA or doing coalescing
during SSA-based register allocation, add the following condition to
the test if two SSA variables interfere (the one proposed by  Budimlic
et. al. and extended by Boissinot, Darte, and Rastello):

If A is defined in the same basic block as B, A dominates B, and A is
dead at the end of the basic block (i.e. it's not part of the live-out
set), then normally we would go backwards from the end of the basic
block to find a use of A that would make A live during the definition
of B (and therefore, A and B would interfere). But if the instruction
where A is used has a complementary predicate to the instruction where
B is defined, then A and B do not interfere and we can ignore the use
and move on. Here, two predicates are "complementary" if it is
impossible for both to be true at once. Note that special care needs
to be taken with conditional selects: if we have C = csel(A, B, pred),
then the use of A is predicated by pred and the use of B is predicated
by !pred.

Finally, for each conditional select, try and coalesce the sources of
each conditional select together to turn it into a move (and if that
succeeds, then try and coalesce the sources with the destination).
This can be done at the same time that the program is being converted
to CSSA form - essentially, "phi congruence classes" become
"phi-and-conditional-select congruence classes" (since, after all, the
conditional selects used to be phi nodes).

This allows us to handle programs like:

if (pred) {
   A = ...
} else {
   A = ...
}

which, after SSA-ification and if-flattening, becomes:

(pred) A_1 = ...
(!pred) A_2 = ...
A_3 = csel(A_1, A_2, pred)

The only use of A_1 has predicate "pred", while the definition of A_2
has predicate "!pred," so A_1 and A_2 do not interfere and they can be
coalesced together, eliminating the csel.

Another example:

A = ...
... = A
if (pred) {
   A = ...
} else {
   ... = A
}

which becomes:

A_1 = ...
... = A_1
(pred) A_2 = ...
(!pred) ... = A_1
A_3 = csel(A_2, A_1, pred)

Again, the only uses of A_1 after A_2 is defined have predicate
"!pred," so A_1 and A_2 don't interfere.

Determining if two variables are complementary may be complicated: two
SSA variables pred1 and pred2 can be a series of ands:

pred1 = A and B and C and ....
pred2 = D and E and F and ....

and if any one of the possible pairs are opposite (A = !D, A = !E, B =
!D, etc.), then pred1 and pred2 will be complimentary. Typically, A,
B, C, etc. will be the conditions of former if-statements. The first
step is to figure out which variables are opposite, i.e. A = !B. Cases
like A = (foo > bar) and B = (bar >= foo) also need to considered
here. So the algorithm, similar to value numbering, goes like:

for each comparison A (<, >=, ==, !=) B:
   let a_index be the index of A, b_index be the index of B, and cond
be the condition
   if there exists an instruction with (!cond, a_index, b_index) in
the hash table, then it and the current instruction are opposite
   insert the instruction into the hash table, with (cond, a_index, b_index)

Next, to find the complementary variables:

Start by marking all the opposite pairs of variables as complementary.
for each and-statement C = A and B, in the order that they are defined
(i.e. in reverse post-order):
   for each variable D that is complementary to A:
      mark C and D as complementary
   for each variable E that is complementary to B:
      mark C and E as complementary

>
>> I've recently re-started work on this after my school work
>> started winding down, and as of now here are work items remaining I
>> can think of:
>>
>> 1) Add unit tests for the conversion to SSA
>> 2) Make the conversion out of SSA not suck as much, esp. with respect
>> to writemasked operations (this is pretty difficult, as apparently
>> it's still an unsolved problem that to my knowledge no one from
>> academia has tried to tackle...)
>> 3) Add some trivial SSA-based optimizations (dead code elimination,
>> copy propagation)
>> 4) Add more complicated optimizations like CSE, constant propagation,
>> GVN-GCM... I believe some of these, especially value numbering, depend
>> on a flat IR to work, so this might be a lot harder to do.
>>
>> I propose that I just do #1, post a new patch series with Paul's
>> comments on the original one addressed (that would probably take less
>> than a week), and then you can use the opt_to_ssa pass to avoid having
>> to duplicate that logic that in the backend since it's pretty
>> complicated. So it would look like:
>>
>> GLSL IR -> opt_to_ssa -> SSA-ified GLSL IR -> i965 fragment shader backend
>
> That sounds good. I'd even settle for getting non-SSA GLSL IR in the
> backend. I expect SSA GLSL IR will allow lots of our optimizations to
> be more effective, and I've got a shader from a benchmark that I could
> cut a pile of instructions out of if only it didn't unnecessarily
> reuse a variable. SSA would totally fix that.
>
>> We would have to modify the pass that scalarizes the GLSL IR to work
>> with phi nodes and ir_quadop_vector expressions, but that wouldn't be
>> too difficult. That way we would have an immediate user of the SSA
>> stuff while being able to make it useful for everyone else in the long
>> term.
>
> Sure. Except I've written the to-SSA code already, and am lacking the
> from-SSA equivalent. :)