[Mesa-dev] [PATCH 0/3] [RFC] mesa/st: glsl_to_tgsi: improved temp-reg lifetime estimation
Nicolai Hähnle
nhaehnle at gmail.com
Mon Jun 12 18:52:51 UTC 2017
On 12.06.2017 14:34, Gert Wollny wrote:
> Am Montag, den 12.06.2017, 12:28 +0200 schrieb Nicolai Hähnle:
>> On 10.06.2017 01:15, Gert Wollny wrote:
>>>
>> Plenty of style issues aside, can you explain where and why you get
>> tighter lifetimes?
>
> In the original code if a temporary is used within a loop it gets the
> whole life time of the loop assigned.
>
> With this patch I track in more detail where a temporary is accesses
> and base the lifetime on this: For instance, if a variable is first
> unconditionally written and later read for the last time in the same
> scope (loop, if, or switch branch), then the lifetime can be restricted
> to that first-written - last-read range.
>
> The code gets more complex because it tries to resolve this also for
> nested scopes, and one also has to take care about whether a variable
> is written only conditionally within a loop, or conditionally read
> before it is written (in the source code sense, but not in the program
> flow sense).
>
> Shaders that profit from this better lifetime estimation are the ones
> that have many short living values within long loops, think
>
> for () {
> float4 t[1] = f(in2);
> float4 t[2] = g(in1);
> float4 t[3] = op(t[1], t[2]);
> ...
> sum += t[200];
> }
>
> Here the old code would keep all of the t[i] alive for the whole loop,
> and in fact with the GpuTest Piano benchmark I have seen a shader with
>> 2000 temporaries where many were used in a loop but only required for
> two or three lines so that my code could merge them to less then 100
> temporaries, and this made it possible for the tgsi to bytecode layer
> in r600g to actually translate the shader.
Okay. I think you should seriously re-think your algorithm in a way that
makes it a more natural evolution from the algorithm that's already there.
Basically, the current algorithm tracks (first_write, last_read), so
think about what you need to track in order to obtain a single-pass
algorithm that computes lifetime (first, last) for every temporary. I
think the following should do it for a first cut:
struct st_calculate_lifetime_state {
/* First and last instruction that has this temporary */
unsigned first;
unsigned last;
/* First instruction of the outer-most in-active loop that
* contains this temporary. (A loop is active if we're
* currently processing instructions in it.)
unsigned loop_first;
/* Position of a read without preceding dominating write. */
unsigned undef_read[4];
/* First write in the program that is dominating our
* current position, per channel.
*/
unsigned first_dominating_write[4];
};
In addition, you need to keep a stack of active scopes (loops and ifs),
but you really only need to remember the start of the scope (and for
loops, probably the position of the first BREAK).
Here's a sketch of the "state machine" that you need to run while
traversing the program, assuming no BRK and CONT:
Init: last = 0, everything else = ~0
These are updates on individual variables on use:
On any use (source or dest):
- if first > cur_pos, set first = loop_first = cur_pos
- if loop_first < first, set first = loop_first
- update last
On use as source:
- if first_dominating_write > cur_pos and undef_read > cur_pos, set
undef_read = cur_pos
On use as dest:
- if first_dominating_write > cur_pos, set first_dominating_write = cur_pos
These are updates of all temporaries on scope change:
On ENDLOOP, for all temps:
- if loop_first > start of loop, set loop_first = start of loop
- first < start of loop, update last to end of loop
- if undef_read between start and end of loop: set first = MIN(first,
start of loop) and last = end of loop
- if first_dominating_write < end of loop, set undef_read = ~0
On ELSE and ENDIF, for all temps:
- if first_dominating_write > start of scope, set first_dominating_write
= ~0
I'm not sure right now whether BREAK / CONT need special treatment at
all. I think what you need is:
On ENDLOOP:
- if first_dominating_write is between the first BREAK in the loop and
the end of the loop, set first_dominating_write = ~0
And for CONT, you probably don't really need anything, because CONTs
cannot make you skip code forever.
What this state machine doesn't yet cover is
IF ..
MOV TEMP[0], ...
ELSE
MOV TEMP[0], ...
ENDIF
Still, I'd start with it and see whether you need to cover that case.
And even that case can probably be dealt with in a fairly efficient and
pragmatic way. The idea is to keep track of the nesting level of IFs,
plus an
uint32 dominating_write_in_true_block[4];
per temp. Then:
On ELSE:
- if first_dominating_write < cur_pos, set the bit corresponding to the
current nesting level in dominating_write_in_true_block
On ENDIF:
- don't reset first_dominating_write if the bit corresponding to the
current nesting level in dominating_write_in_true_block is set
- unconditionally clear that bit
It won't cover cases with nesting level > 32, but do we really care?
I hope I didn't miss anything, because after all this is admittedly
subtle stuff. Still, I think this kind of state-machine approach should
work and allow you to avoid *lots* of allocations and pointer-chasing.
Cheers,
Nicolai
>
> Best,
> Gert
>
> PS: Regarding style, I am fully aware that I have to iterate over this
> code a few more times. I tried to adhere to the way the existing code
> represents itself to me, but I'm happy to listen to any advise I can
> get.
>
> In any case, I though it might be best to send this patch out now for
> discussion. Now, with the unit tests in place I will rework it and
> focus also more on style questions. One thing that comes up immediately
> up is that I will try to reduce the use of dynamically allocated
> memory, since 60% of the run time of my code is with memory allocation
> and de-allocation.
>
>
--
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
More information about the mesa-dev
mailing list