[Mesa-dev] [PATCH 0/3] [RFC] mesa/st: glsl_to_tgsi: improved temp-reg lifetime estimation

Mon Jun 12 18:52:51 UTC 2017

On 12.06.2017 14:34, Gert Wollny wrote:
> Am Montag, den 12.06.2017, 12:28 +0200 schrieb Nicolai Hähnle:
>> On 10.06.2017 01:15, Gert Wollny wrote:
>>>
>> Plenty of style issues aside, can you explain where and why you get
>> tighter lifetimes?
> 
> In the original code if a temporary is used within a loop it gets the
> whole life time of the loop assigned.
> 
> With this patch I track in more detail where a temporary is accesses
> and base the lifetime on this: For instance, if a variable is first
> unconditionally written and later read for the last time in the same
> scope (loop, if, or switch branch), then the lifetime can be restricted
> to that first-written - last-read range.
> 
> The code gets more complex because it tries to resolve this also for
> nested scopes, and one also has to take care about whether a variable
> is written only conditionally within a loop, or conditionally read
> before it is written (in the source code sense, but not in the program
> flow sense).
> 
> Shaders that profit from this better lifetime estimation are the ones
> that have many short living values within long loops, think
> 
> for () {
>    float4 t[1] = f(in2);
>    float4 t[2] = g(in1);
>    float4 t[3] = op(t[1], t[2]);
>     ...
>    sum += t[200];
> }
>   
> Here the old code would keep all of the t[i] alive for the whole loop,
> and in fact with the GpuTest Piano benchmark I have seen a shader with
>> 2000 temporaries where many were used in a loop but only required for
> two or three lines so that my code could merge them to less then 100
> temporaries, and this made it possible for the tgsi to bytecode layer
> in r600g to actually translate the shader.

Okay. I think you should seriously re-think your algorithm in a way that 
makes it a more natural evolution from the algorithm that's already there.

Basically, the current algorithm tracks (first_write, last_read), so 
think about what you need to track in order to obtain a single-pass 
algorithm that computes lifetime (first, last) for every temporary. I 
think the following should do it for a first cut:

    struct st_calculate_lifetime_state {
       /* First and last instruction that has this temporary */
       unsigned first;
       unsigned last;

       /* First instruction of the outer-most in-active loop that
        * contains this temporary. (A loop is active if we're
        * currently processing instructions in it.)
       unsigned loop_first;

       /* Position of a read without preceding dominating write. */
       unsigned undef_read[4];

       /* First write in the program that is dominating our
        * current position, per channel.
        */
       unsigned first_dominating_write[4];
    };

In addition, you need to keep a stack of active scopes (loops and ifs), 
but you really only need to remember the start of the scope (and for 
loops, probably the position of the first BREAK).

Here's a sketch of the "state machine" that you need to run while 
traversing the program, assuming no BRK and CONT:

Init: last = 0, everything else = ~0

These are updates on individual variables on use:

On any use (source or dest):
- if first > cur_pos, set first = loop_first = cur_pos
- if loop_first < first, set first = loop_first
- update last

On use as source:
- if first_dominating_write > cur_pos and undef_read > cur_pos, set 
undef_read = cur_pos

On use as dest:
- if first_dominating_write > cur_pos, set first_dominating_write = cur_pos

These are updates of all temporaries on scope change:

On ENDLOOP, for all temps:
- if loop_first > start of loop, set loop_first = start of loop
- first < start of loop, update last to end of loop
- if undef_read between start and end of loop: set first = MIN(first, 
start of loop) and last = end of loop
- if first_dominating_write < end of loop, set undef_read = ~0

On ELSE and ENDIF, for all temps:
- if first_dominating_write > start of scope, set first_dominating_write 
= ~0

I'm not sure right now whether BREAK / CONT need special treatment at 
all. I think what you need is:

On ENDLOOP:
- if first_dominating_write is between the first BREAK in the loop and 
the end of the loop, set first_dominating_write = ~0

And for CONT, you probably don't really need anything, because CONTs 
cannot make you skip code forever.

What this state machine doesn't yet cover is

IF ..
   MOV TEMP[0], ...
ELSE
   MOV TEMP[0], ...
ENDIF

Still, I'd start with it and see whether you need to cover that case.

And even that case can probably be dealt with in a fairly efficient and 
pragmatic way. The idea is to keep track of the nesting level of IFs, 
plus an

    uint32 dominating_write_in_true_block[4];

per temp. Then:

On ELSE:
- if first_dominating_write < cur_pos, set the bit corresponding to the 
current nesting level in dominating_write_in_true_block

On ENDIF:
- don't reset first_dominating_write if the bit corresponding to the 
current nesting level in dominating_write_in_true_block is set
- unconditionally clear that bit

It won't cover cases with nesting level > 32, but do we really care?

I hope I didn't miss anything, because after all this is admittedly 
subtle stuff. Still, I think this kind of state-machine approach should 
work and allow you to avoid *lots* of allocations and pointer-chasing.

Cheers,
Nicolai

> 
> Best,
> Gert
> 
> PS: Regarding style, I am fully aware that I have to iterate over this
> code a few more times. I tried to adhere to the way the existing code
> represents itself to me, but I'm happy to listen to any advise I can
> get.
> 
> In any case, I though it might be best to send this patch out now for
> discussion. Now, with the unit tests in place I will rework it and
> focus also more on style questions. One thing that comes up immediately
> up is that I will try to reduce the use of dynamically allocated
> memory, since 60% of the run time of my code is with memory allocation
> and de-allocation.
> 
> 

-- 
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.