[Mesa-dev] r600g: status of my work on the shader optimization

Wed Feb 13 18:04:57 PST 2013

Hi,

Last month I finally found the time to work on the rewrite of my 
previous shader optimization branch, now it's mostly done in terms of 
the correctness of produced code and feature support (at least on 
evergreen), though it's still a work in progress in terms of the 
efficiency of generated shader code and the efficiency of the backend 
itself.

I spent some time last year studying the LLVM infrastructure and R600 
LLVM backend and trying to improve it, but after all I came to the 
conclusion that for me it might be easier to implement all that I wanted 
in the custom backend. This allows for more simple and efficient 
implementation - e.g. I don't have to deal with CFGs because in fact we 
have structured code, so it's possible to use more simple and efficient 
algorithms.

Currently the branch has no regressions with piglit's quick-driver.tests 
on evergreen (it doesn't rely on the fallback to unoptimized code for 
the shaders with relative addressing and other cases unlike the previous 
branch), and so far I don't see any rendering issues with the apps that 
I used for testing -  Lightsmark 2008, Unigine Heaven 3.0 and some 
others. There are also some performance improvements with the gpu-bound 
apps.

I tried to keep in mind the differences between chip classes, so I hope 
it should only require minor fixes to make it work on non-evergreen 
chips, but I doubt that it will work out of the box - support for some 
non-evergreen hw-specific features is still missing, e.g. I'm sure that 
indirect addressing currently won't work on R6xx, though basic tests 
might work in theory. Fixing this shouldn't require a lot of work though.

The branch can be found in my freedesktop repo:

http://cgit.freedesktop.org/~vadimg/mesa/log/?h=r600-sb

Regarding the differences from the previous branch - there are some 
additional optimizations, e.g. global value numbering with some basic 
support for constant folding (not all instructions are currently 
handled, but it's easy to extend), global code motion that can hoist 
invariant code out of the loops etc. Some optimizations that were 
implemented in the previous branch are not implemented in the new branch 
(yet), e.g. propagation of modifiers (I'm not even sure if it has any 
noticeable effect on performance).

Unlike the previous branch, there is support for indirect addressing on 
registers -  currently it uses my previously posted patch (that was not 
very welcome) for obtaining the  information about addressable register 
ranges, but it's not required and can be dropped, I just used that patch 
for testing. Without that information opportunities for optimization are 
limited though, and perhaps it makes sense to not try to optimize the 
shaders with indirect gpr addressing at all and rely on the old backend 
until we'll have the proper solution to pass that information to the 
drivers.

There is also initial support for ALU predication, but it's not complete 
and currently unused, I'm not sure if predication support will have 
significant effect on performance that will justify more complex and 
expensive algorithms for register allocator and scheduler, probably I'll 
look into it later, I consider this as a low priority. In the case of 
predicated source code (from LLVM backend) the predication is eliminated 
using speculative execution and conditional moves, same as with the 
simple if-conversion pass that is also implemented.

The branch currently uses as source the bytecode built by the old 
backend (that may also come from LLVM backend) and some additional 
information (about inputs etc), final bytecode is built by the new 
builder in the branch. Building two versions of the bytecode doesn't 
look very efficient, but currently it simplifies debugging. I'm planning 
to implement translation from TGSI directly to my representation, it 
should simplify the translator and allow to get rid of unnecessary 
intermediate passes.

Some old and new environment variables can be used to control the 
behavior of this backend:

R600_SB - 0 - disable new backend completely, 1 - enable (default)
R600_SB_USE_NEW_BYTECODE - 0 - disable use of the produced bytecode 
(useful if you only want to look at the dump of the optimized shader 
without passing it to hw), 1 - enable (default)
R600_DUMP_SHADERS - will also dump the dissasemble of the optimized 
shader after original bytecode (if backend is not disabled with R600_SB=0).

Produced shader code is not ideal - e.g. you may notice not very 
necessary MOVs inserted before DOT4 instructions, it's a known issue and 
I'm going to look into it - this may require rework of the 
regalloc/scheduler. I had to sacrifice some features to make it work 
correctly with Heaven first, so that now I can try to improve it while 
being able to test for regressions.

Also probably there are some issues with the cleanness of the code - I 
had to rework some parts a few times while fixing all problems, so there 
is possibly unused code and other remnants of the previous versions. 
Anyway, I still consider it as a work in progress and some things are 
going to be reworked.

I'm not sure what will be the destiny of this branch, taking into 
account that we also have actively developed LLVM backend that is 
required for OpenCL anyway. Your opinions are welcome.

Vadim