[Mesa-dev] [PATCH 6/9] nir: Make dead_write_vars pass global

Mon Aug 27 21:38:58 UTC 2018

Jason Ekstrand <jason at jlekstrand.net> writes:

> On Sat, Aug 25, 2018 at 9:40 AM Jason Ekstrand <jason at jlekstrand.net> wrote:
>
>> Sorry I haven't given you any in-line review yet.  I've been letting this
>> pass simmer in my brain a bit and thinking about complexity and data
>> structures.  Here's a few questions for which I do not expect you to have
>> actual answers:
>>
>>  1) How many unique derefs should we expect to see in a shader?  Dozens?
>> Hundreds?  Thousands?  My guess is that, after vars_to_ssa has picked off
>> all the easy ones, it should be on the order of a couple hundred at most
>> but I don't actually know what a reasonable upper limit to assume is.

My assumption here was the number was small (dozens), and would be
potentially increase as we move other intrinsics to use derefs
(e.g. SSBO).

>>  2) Is every deref going to be eventually compared to every other deref or
>> are we going to have separate little groups of derefs?

My assumption here is that we don't expect to compare all the derefs,
but only little groups.  This affected my choice.

I can verify both assumptions with shader-db, but I expect scenario
change as we add support for more intrinsics to use derefs.

>> I'm asking these questions because I have some concern about how expensive
>> this pass (and copy prop) is going to be.  If you have a shader with a huge
>> number of loops each of which has a very self-contained set of derefs, the
>> current pass as-is probably has an efficiency that's around O(n) in the
>> number of instructions.  However, in the case where most derefs are in play
>> all the time, then I supsect this pass as-is looks more like O(n^3) in the
>> number of derefs or worse.

I did post this with a suboptimal deref_map that had the sufficient
interface so the rest of the code could work.  With the expected
implementation (a tree, similar to lower_vars_to_ssa) we have a better
behavior.  Assuming n derefs with at most m size (in general m is small
enough so we don't care):

- "may alias": O(m), walk tree and deref_path in parallel.

- "intersection": O(n * m), by walking both trees in parallel and marking the
  common bits.

- "union": O(n * m), by walking both trees in parallel and adding the
  missing bits to the target.

- "subtraction": O(m)
- "subtraction map": O(n * m), walking both trees in parallel

So with some simplification, these are the complexities we are looking
at.

- local pass: O(I * n * m) number of instructions times the number of
  derefs, to build all the maps.  Which can be simplified to O(n^2 * m)
  ~ O(n^2).

- data flow iteration: for each block (with strech, is proportional to
  n), perform map operations, so we end up with O(n^2 * m).

- final pass: for each block, for each "unused write", check if the OUT
  map contains it. O(n^2 * m).

In retrospect I should've been more clear on what expect from deref_map
now (and later).  And of course I might be overlooking something :-)

>>
>> If the total number of unique derefs is reasonable AND the graph of deref
>> comparisons isn't too sparse, then I wonder if it wouldn't be better to
>> have some intermediate data structure like this:
>>
>> struct deref_compare_map {
>>    void *mem_ctx;
>>
>>    struct hash_table *id_map;
>>
>>    nir_deref_path *paths;
>>    BITSET_WORD **alias;
>>    BITSET_WORD **contains;
>>    BITSET_WORD **contained_in;
>> };

I did explore something _similar_ for this pass: collecting the relevant
derefs, keep them in an array, use their indices in the bitsets
(combined with the components); plus bootstrap a "matrix" to compare
them.  Because I had the "little groups" assumption in mind, I was even
toying with lazy initialize the matrix, so only really compare if we
need to.

(...)

>> The idea would be to make it a 5-step pass:
>>
>>  1) find all derefs
>>  2) Bake the deref compare map
>>  3) Local elimination and building kill etc. sets.
>>  4) Data-flow
>>  5) global elimination
>>
>> Doing that would give us some better run-time guarantees:
>>
>>  1) Set operations on bitsets are O(n) (where n is actually
>> DIV_ROUND_UP(n, 32) thanks to bitsets)
>>  2) We call nir_deref_path_init exactly num_derefs times for the entire
>> pass
>>  3) We call nir_compare_deref_paths exactly (num_derefs * (num_derefs -
>> 1)) / 2 times for the entire pass.

(...)

>> As it is, deref_map_add_map is O(n^2) comparisons and we do that operation
>> inside our fixed-point loop which has an unknown number of iterations.
>> This data structure could easily be put in nir_derefs.c and shared between
>> this pass and copy-prop.

Yep, as-is deref_map_add_map is not good, but the plan was replace it
with something better.  Seems to me it can be implemented in O(n).

(...)

> Also, if we're going to share it between passes, it might be good to have
> nir_deref_compare_map_init take a shader and do the walk to find all derefs
> and the bake in a single step and we can avoid exposing quite as many
> details to the user.

That is a good point and might justify the work to pre bake the compare
map.  I'll think more about it.  Also, it wouldn't hurt if we could get
some space in the derefs themselves to store their index in the compare
map.  Maybe something like the pass_flag would be enough.

(Reordered this paragraph for the sake of the reply...)

>> Just a thought.  Of course, you've spent far more time playing with this
>> than I have so I may be completely off-base.  Before you go off and
>> re-write it, let's have a little e-mail discussion to decide if we think
>> the rewrite is worth it.  :-)

Given what is planned for deref_map (tree-based implementation), are you
still worried about algorithm complexity?  I'm leaning towards having
some implementation, then make new deref types, so we can get better
data.

Thanks,
Caio