[Mesa-dev] r600g: status of my work on the shader optimization

Sat Feb 16 10:23:49 PST 2013

On 02/16/2013 11:10 AM, Mathias � wrote:
>
> Hi,
>
> On Friday, February 15, 2013 15:00:24 Vadim Girlin wrote:
>> "LLVM backend is the future" is a pretty abstract argument. I prefer to
>> operate with real facts. After a year of LLVM backend development what
>> are the real benefits for the users? What are the real use cases where
>> the users might prefer LLVM backend? To me this situation looks like the
>> use of LLVM requires a lot more time and development efforts than the
>> custom solution, despite the initial expectations. Maybe you are right
>> and the LLVM backend will become the best alternative for users sometime
>> in the future, but I only have some today's results:
>
> I am curious how this compares for shaders like used in
>
> git clone
> git://anarchy.freedesktop.org/~frohlich/PrecomputedAtmosphericScattering.git
>
> which is one of the bigger programs I know of. That one is taken from a Paper
> from INRIA regarding atmospheric scattering. The git archive is just some
> striped down variants of that to make it at least display with our radeon type
> drivers. You need float textures enabled in the configure step. The only
> variants that have a chance to run on the oss drivers is Main.noprecompute and
> Main.nogeometry once you compiled this dirty proof of concept files.
>
> That saied, the shaders there are far from programmed optimal IMO. But they
> render fast on the binary blobs. And I think they are a good example what you
> will find in the wild. People out there expect to find a decent compiler even in
> an OpenGL driver. I mean a compiler that does not just translate a tight pre
> optimized shader into the apropriate backend language, but a compiler that
> knows about all the tricks that are required to optimize more complex and less
> good written programs. Doing good optimizers - not only the backend ones - is
> a hard and longish busines which is easily underestimated.
>

(@Mathias, sorry, I accidentally sent the unfinished draft of this reply 
as a private mail, see added results with your program in the end)

The problem is that LLVM knows nothing about optimization for r600 
architecture. Yes, there are tricks, but these tricks are often 
hardware-specific. Many tricks that are useful for other architectures 
are useless for r600, or even make the result worse. Many tricks that 
are useful for r600 are not implemented in LLVM. You have to implement 
them as a custom pass anyway, working with not very convenient code 
representation. Another way is to simply implement what you need in the 
any way you like without dealing with the LLVM.

Also, even the most common tricks often don't work as expected without 
additional work.

Let me show a small example. Here is a short test shader:

uniform int a;
uniform float b;
uniform float c;
void main()
{
	vec4 t = vec4(1.0);
	float q;

	for (int k = 0; k < a; ++k) {
		q = q + 0.1;
		t.x += q * sin(b * 3.0 * c + 10.0);
		t.y *= sin(b * 3.0) * cos(b * 4.0);
	}
	gl_FragColor = t;
}

It's not very optimal, because most of expressions inside the loop do 
not depend on loop variable, so we can expect some smart trick from 
LLVM, e.g. that it will move all loop-invariant expressions outside of 
the loop.

Here is the code produced by LLVM backend:

> --------------------------------------------------------------
> bytecode 80 dw -- 3 gprs -- 4 stack entries -------
> shader 11 -- E
> 0000 4000000A A0140000  ALU 6 @20 KC0[CB0:0-16]
>  0020 000000F9 00000C90     1      MOV                      R0.x,  1.0
>  0022 000000F9 20000C90            MOV                      R0.y,  1.0
>  0024 000000F8 40000C90            MOV                      R0.z,  0
>  0026 000000F8 60000C90            MOV                      R0.w,  0
>  0028 801FA081 40400110            MUL_IEEE                 R2.z,  KC0[1].x, [0x40800000 4.000000]
>  0030 40800000                                               4.000000 (1082130432)
> 0002 00000008 81800000  LOOP_START_DX10 @16
> 0004 40000010 A4040000  ALU_PUSH_BEFORE 2 @32 KC0[CB0:0-16]
>  0032 81800082 00201D90     2      SETGT_INT                R1.x,  KC0[2].x, R0.w
>  0034 801F00FE 00002104     3 M    PRED_SETE_INT            __.x,  PV.x, 0
> 0006 00000006 82800001  JUMP @12 POP:1
> 0008 00000007 82400000  LOOP_BREAK @14
> 0010 00000006 83800001  POP @12 POP:1
> 0012 40000012 A0400000  ALU 17 @36 KC0[CB0:0-16]
>  0036 001FA081 00200110     4      MUL_IEEE                 R1.x,  KC0[1].x, [0x40400000 3.000000]
>  0038 000004FD 20200C90            MOV                      R1.y,  [0x3E22F983 0.159155]
>  0040 011FA800 40000010            ADD                      R0.z,  R0.z, [0x3DCCCCCD 0.100000]
>  0042 801F4C00 60001A10            ADD_INT                  R0.w,  R0.w, 1
>  0044 40400000                                               3.000000 (1077936128)
>  0045 3E22F983                                               0.159155 (1042479491)
>  0046 3DCCCCCD                                               0.100000 (1036831949)
>  0048 001000FE 204300FD     5      MULADD_IEEE              R2.y,  PV.x, KC0[0].x, [0x41200000 10.000000]
>  0050 001FC4FE 40200090            MUL                      R1.z,  PV.y, PV.x
>  0052 810044FE 60200090            MUL                      R1.w,  PV.y, R2.z
>  0054 41200000                                               10.000000 (1092616192)
>  0056 009FC401 00200090     6      MUL                      R1.x,  R1.y, PV.y
>  0058 800008FE 20204690            SIN                      R1.y,  PV.z
>  0060 80000C01 40204710     7      COS                      R1.z,  R1.w
>  0062 001FE401 20200110     8      MUL_IEEE                 R1.y,  R1.y, PS
>  0064 80000001 00204690            SIN                      R1.x,  R1.x
>  0066 009FC000 00000110     9      MUL_IEEE                 R0.x,  R0.x, PV.y
>  0068 801FE800 20030400            MULADD_IEEE              R0.y,  R0.z, PS, R0.y
> 0014 00000002 81400000  LOOP_END @4
> 0016 00000023 A0100000  ALU 5 @70
>  0070 800000F9 00200C90    10      MOV                      R1.x,  1.0
>  0072 00000400 80200C90    11      MOV_sat                  R1.x,  R0.y
>  0074 00000000 A0200C90            MOV_sat                  R1.y,  R0.x
>  0076 800000FE C0200C90            MOV_sat                  R1.z,  PV.x
>  0078 800008FE 60200C90    12      MOV                      R1.w,  PV.z
> 0018 C0008000 95200688  EXPORT_DONE        PIXEL 0     R1.xyzw      ES:3 EOP
> --------------------------------------

I'm not sure if you are familiar with r600 ISA, but I can explain - LLVM 
moved the computation of the expression "b * 4.0" outside of the loop, 
everything else is left inside the loop. It's obviously possible to move 
"sin(b * 3.0) * cos(b * 4.0)" and "sin(b * 3.0 * c + 10.0)", but LLVM 
missed this opportunity.

Here is what my branch does with the code above:

> ===== SHADER_START ================================== PS/JUNIPER/EVERGREEN =====
> ===== 72 dw ===== 2 gprs ===== 2 stack =========================================
> 0000  4000000a a0400000 ALU 17 @20 KC0[CB0:0-16]
>  0020  001fa081 0fa00110     1      MUL_IEEE              T1.x,  KC0[1].x, [0x40400000 3].x
>  0022  809fa081 2f800110            MUL_IEEE              T0.y,  KC0[1].x, [0x40800000 4].y
>  0024  40400000
>  0025  40800000
>  0026  0010007d 0f8300fd     2      MULADD_IEEE           T0.x,  T1.x, KC0[0].x, [0x41200000 10].x
>  0028  808f84fd 2f800090            MUL                   T0.y,  [0x3e22f983 0.159155].y, T0.y
>  0030  41200000
>  0031  3e22f983
>  0032  000f80fd 0f800090     3      MUL                   T0.x,  [0x3e22f983 0.159155].x, T0.x
>  0034  000fa0fd 4f840090            MUL                   T0.z,  [0x3e22f983 0.159155].x, T1.x  BS:1   VEC_021
>  0036  8000047c 2f804710            COS                   T0.y,  T0.y
>  0038  3e22f983
>  0040  000000f9 00000c90     4      MOV                   R0.x,  1.0
>  0042  8000087c 4f804690            SIN                   T0.z,  T0.z
>  0044  008f887c 00200110     5      MUL_IEEE              R1.x,  T0.z, T0.y
>  0046  000000f9 20000c90            MOV                   R0.y,  1.0
>  0048  000000f8 40000c90            MOV                   R0.z,  0
>  0050  000000f8 60000c90            MOV                   R0.w,  0
>  0052  8000007c 20204690            SIN                   R1.y,  T0.x
> 0002  00000008 81800000 LOOP_START_DX10 @16
> 0004  4000001b a4040000 ALU_PUSH_BEFORE 2 @54 KC0[CB0:0-16]
>  0054  81800082 4f801d90     6      SETGT_INT             T0.z,  KC0[2].x, R0.w
>  0056  801f087c 00002104     7 M    PRED_SETE_INT         __.x,  T0.z, 0
> 0006  00000006 82800001 JUMP @12 POP:1
> 0008  00000007 82400000 LOOP_BREAK @14
> 0010  00000000 83800001 POP @0 POP:1
> 0012  0000001d a0100000 ALU 5 @58
>  0058  801fa800 40000010     8      ADD                   R0.z,  R0.z, [0x3dcccccd 0.1].x
>  0060  3dcccccd
>  0062  00802800 00030000     9      MULADD_IEEE           R0.x,  R0.z, R1.y, R0.x
>  0064  00002400 20000110            MUL_IEEE              R0.y,  R0.y, R1.x
>  0066  801f4c00 60001a10            ADD_INT               R0.w,  R0.w, 1
> 0014  00000002 81400000 LOOP_END @4
> 0016  00000022 a0040000 ALU 2 @68
>  0068  00000000 80000c90    10      MOV_sat               R0.x,  R0.x
>  0070  80000400 a0000c90            MOV_sat               R0.y,  R0.y
> 0018  c0000000 95200b48 EXPORT_DONE        PIXEL 0    R0.xy11  EOP
> ===== SHADER_END ===============================================================

Most computations are now done before the loop, including expensive SIN 
and COS operations. Main loop body is now consists of 2 simple VLIW 
instructions - additions and multiplications, instead of 6 generated by 
LLVM.

As you can see, LLVM is not a magic tool that does all tricks without 
additional efforts.

Why LLVM wasn't able to do the same? I don't know. Of course, I can 
spend some time to investigate this, read LLVM code and figure out the 
reason, and probably fix it somehow, maybe by adding custom 
implementation. But I prefer to implement some simple algorithm that 
works as I expect from the beginning and forget about it, instead of 
investigating every case like with the example above. If there is some 
special trick in LLVM that is missing in my branch, it's easier for me 
to implement the same trick using some more simple algorithm, than to 
try to make it work as I want with LLVM. If you have to spend your time 
on this anyway, then what's a benefit of LLVM?

There is another problem in the example above - the sequence of MOV 
instructions after the loop.

E.g. if you have the following code:

MOV R1.x, 5,
MOV R1.y, 5,

LLVM "optimizes" it :

MOV R1.x, 5
MOV R1.y, R1.x

The problem is that original variant can be executed on r600 as a single 
VLIW instruction in a single cycle, second variant introduces data 
dependency between instructions and now they have to be executed 
sequentially, requiring two cycles. Why LLVM does that? Just because 
it's good for some other architecture. LLVM doesn't know that in the 
original code the instructions can be executed in parallel. So now you 
have to spend your time to find a way to disable this "optimization" 
that only makes the code worse for r600. To me it looks like a waste of 
time.

I'm not saying that LLVM is completely useless - I'm sure it's very 
useful for other (more conventional) architectures or when you also need 
some additional standalone tools and can reuse a lot of existing code, 
it's just not very helpful in this particular case - compilation of GL 
shaders for the r600 architecture.

Regarding your program, using the LIBGL_SHOW_FPS variable to compare FPS 
with Main.noprecompute gives me the following results on my HD5750:

R600_LLVM=0 R600_SB=0 : FPS = 144.1
R600_LLVM=1 R600_SB=0 : FPS = 288.1
R600_LLVM=1 R600_SB=1 : FPS = 518.4
R600_LLVM=0 R600_SB=1 : FPS = 527.2

Vadim

> In the long term I would vote for your knowledge about these machines
> available in llvm to get the best of both worlds.
>
> my 2 cents...
>
> Mathias
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>