[Mesa-dev] r600g: status of my work on the shader optimization
Vadim Girlin
vadimgirlin at gmail.com
Sat Feb 16 10:23:49 PST 2013
On 02/16/2013 11:10 AM, Mathias � wrote:
>
> Hi,
>
> On Friday, February 15, 2013 15:00:24 Vadim Girlin wrote:
>> "LLVM backend is the future" is a pretty abstract argument. I prefer to
>> operate with real facts. After a year of LLVM backend development what
>> are the real benefits for the users? What are the real use cases where
>> the users might prefer LLVM backend? To me this situation looks like the
>> use of LLVM requires a lot more time and development efforts than the
>> custom solution, despite the initial expectations. Maybe you are right
>> and the LLVM backend will become the best alternative for users sometime
>> in the future, but I only have some today's results:
>
> I am curious how this compares for shaders like used in
>
> git clone
> git://anarchy.freedesktop.org/~frohlich/PrecomputedAtmosphericScattering.git
>
> which is one of the bigger programs I know of. That one is taken from a Paper
> from INRIA regarding atmospheric scattering. The git archive is just some
> striped down variants of that to make it at least display with our radeon type
> drivers. You need float textures enabled in the configure step. The only
> variants that have a chance to run on the oss drivers is Main.noprecompute and
> Main.nogeometry once you compiled this dirty proof of concept files.
>
> That saied, the shaders there are far from programmed optimal IMO. But they
> render fast on the binary blobs. And I think they are a good example what you
> will find in the wild. People out there expect to find a decent compiler even in
> an OpenGL driver. I mean a compiler that does not just translate a tight pre
> optimized shader into the apropriate backend language, but a compiler that
> knows about all the tricks that are required to optimize more complex and less
> good written programs. Doing good optimizers - not only the backend ones - is
> a hard and longish busines which is easily underestimated.
>
(@Mathias, sorry, I accidentally sent the unfinished draft of this reply
as a private mail, see added results with your program in the end)
The problem is that LLVM knows nothing about optimization for r600
architecture. Yes, there are tricks, but these tricks are often
hardware-specific. Many tricks that are useful for other architectures
are useless for r600, or even make the result worse. Many tricks that
are useful for r600 are not implemented in LLVM. You have to implement
them as a custom pass anyway, working with not very convenient code
representation. Another way is to simply implement what you need in the
any way you like without dealing with the LLVM.
Also, even the most common tricks often don't work as expected without
additional work.
Let me show a small example. Here is a short test shader:
uniform int a;
uniform float b;
uniform float c;
void main()
{
vec4 t = vec4(1.0);
float q;
for (int k = 0; k < a; ++k) {
q = q + 0.1;
t.x += q * sin(b * 3.0 * c + 10.0);
t.y *= sin(b * 3.0) * cos(b * 4.0);
}
gl_FragColor = t;
}
It's not very optimal, because most of expressions inside the loop do
not depend on loop variable, so we can expect some smart trick from
LLVM, e.g. that it will move all loop-invariant expressions outside of
the loop.
Here is the code produced by LLVM backend:
> --------------------------------------------------------------
> bytecode 80 dw -- 3 gprs -- 4 stack entries -------
> shader 11 -- E
> 0000 4000000A A0140000 ALU 6 @20 KC0[CB0:0-16]
> 0020 000000F9 00000C90 1 MOV R0.x, 1.0
> 0022 000000F9 20000C90 MOV R0.y, 1.0
> 0024 000000F8 40000C90 MOV R0.z, 0
> 0026 000000F8 60000C90 MOV R0.w, 0
> 0028 801FA081 40400110 MUL_IEEE R2.z, KC0[1].x, [0x40800000 4.000000]
> 0030 40800000 4.000000 (1082130432)
> 0002 00000008 81800000 LOOP_START_DX10 @16
> 0004 40000010 A4040000 ALU_PUSH_BEFORE 2 @32 KC0[CB0:0-16]
> 0032 81800082 00201D90 2 SETGT_INT R1.x, KC0[2].x, R0.w
> 0034 801F00FE 00002104 3 M PRED_SETE_INT __.x, PV.x, 0
> 0006 00000006 82800001 JUMP @12 POP:1
> 0008 00000007 82400000 LOOP_BREAK @14
> 0010 00000006 83800001 POP @12 POP:1
> 0012 40000012 A0400000 ALU 17 @36 KC0[CB0:0-16]
> 0036 001FA081 00200110 4 MUL_IEEE R1.x, KC0[1].x, [0x40400000 3.000000]
> 0038 000004FD 20200C90 MOV R1.y, [0x3E22F983 0.159155]
> 0040 011FA800 40000010 ADD R0.z, R0.z, [0x3DCCCCCD 0.100000]
> 0042 801F4C00 60001A10 ADD_INT R0.w, R0.w, 1
> 0044 40400000 3.000000 (1077936128)
> 0045 3E22F983 0.159155 (1042479491)
> 0046 3DCCCCCD 0.100000 (1036831949)
> 0048 001000FE 204300FD 5 MULADD_IEEE R2.y, PV.x, KC0[0].x, [0x41200000 10.000000]
> 0050 001FC4FE 40200090 MUL R1.z, PV.y, PV.x
> 0052 810044FE 60200090 MUL R1.w, PV.y, R2.z
> 0054 41200000 10.000000 (1092616192)
> 0056 009FC401 00200090 6 MUL R1.x, R1.y, PV.y
> 0058 800008FE 20204690 SIN R1.y, PV.z
> 0060 80000C01 40204710 7 COS R1.z, R1.w
> 0062 001FE401 20200110 8 MUL_IEEE R1.y, R1.y, PS
> 0064 80000001 00204690 SIN R1.x, R1.x
> 0066 009FC000 00000110 9 MUL_IEEE R0.x, R0.x, PV.y
> 0068 801FE800 20030400 MULADD_IEEE R0.y, R0.z, PS, R0.y
> 0014 00000002 81400000 LOOP_END @4
> 0016 00000023 A0100000 ALU 5 @70
> 0070 800000F9 00200C90 10 MOV R1.x, 1.0
> 0072 00000400 80200C90 11 MOV_sat R1.x, R0.y
> 0074 00000000 A0200C90 MOV_sat R1.y, R0.x
> 0076 800000FE C0200C90 MOV_sat R1.z, PV.x
> 0078 800008FE 60200C90 12 MOV R1.w, PV.z
> 0018 C0008000 95200688 EXPORT_DONE PIXEL 0 R1.xyzw ES:3 EOP
> --------------------------------------
I'm not sure if you are familiar with r600 ISA, but I can explain - LLVM
moved the computation of the expression "b * 4.0" outside of the loop,
everything else is left inside the loop. It's obviously possible to move
"sin(b * 3.0) * cos(b * 4.0)" and "sin(b * 3.0 * c + 10.0)", but LLVM
missed this opportunity.
Here is what my branch does with the code above:
> ===== SHADER_START ================================== PS/JUNIPER/EVERGREEN =====
> ===== 72 dw ===== 2 gprs ===== 2 stack =========================================
> 0000 4000000a a0400000 ALU 17 @20 KC0[CB0:0-16]
> 0020 001fa081 0fa00110 1 MUL_IEEE T1.x, KC0[1].x, [0x40400000 3].x
> 0022 809fa081 2f800110 MUL_IEEE T0.y, KC0[1].x, [0x40800000 4].y
> 0024 40400000
> 0025 40800000
> 0026 0010007d 0f8300fd 2 MULADD_IEEE T0.x, T1.x, KC0[0].x, [0x41200000 10].x
> 0028 808f84fd 2f800090 MUL T0.y, [0x3e22f983 0.159155].y, T0.y
> 0030 41200000
> 0031 3e22f983
> 0032 000f80fd 0f800090 3 MUL T0.x, [0x3e22f983 0.159155].x, T0.x
> 0034 000fa0fd 4f840090 MUL T0.z, [0x3e22f983 0.159155].x, T1.x BS:1 VEC_021
> 0036 8000047c 2f804710 COS T0.y, T0.y
> 0038 3e22f983
> 0040 000000f9 00000c90 4 MOV R0.x, 1.0
> 0042 8000087c 4f804690 SIN T0.z, T0.z
> 0044 008f887c 00200110 5 MUL_IEEE R1.x, T0.z, T0.y
> 0046 000000f9 20000c90 MOV R0.y, 1.0
> 0048 000000f8 40000c90 MOV R0.z, 0
> 0050 000000f8 60000c90 MOV R0.w, 0
> 0052 8000007c 20204690 SIN R1.y, T0.x
> 0002 00000008 81800000 LOOP_START_DX10 @16
> 0004 4000001b a4040000 ALU_PUSH_BEFORE 2 @54 KC0[CB0:0-16]
> 0054 81800082 4f801d90 6 SETGT_INT T0.z, KC0[2].x, R0.w
> 0056 801f087c 00002104 7 M PRED_SETE_INT __.x, T0.z, 0
> 0006 00000006 82800001 JUMP @12 POP:1
> 0008 00000007 82400000 LOOP_BREAK @14
> 0010 00000000 83800001 POP @0 POP:1
> 0012 0000001d a0100000 ALU 5 @58
> 0058 801fa800 40000010 8 ADD R0.z, R0.z, [0x3dcccccd 0.1].x
> 0060 3dcccccd
> 0062 00802800 00030000 9 MULADD_IEEE R0.x, R0.z, R1.y, R0.x
> 0064 00002400 20000110 MUL_IEEE R0.y, R0.y, R1.x
> 0066 801f4c00 60001a10 ADD_INT R0.w, R0.w, 1
> 0014 00000002 81400000 LOOP_END @4
> 0016 00000022 a0040000 ALU 2 @68
> 0068 00000000 80000c90 10 MOV_sat R0.x, R0.x
> 0070 80000400 a0000c90 MOV_sat R0.y, R0.y
> 0018 c0000000 95200b48 EXPORT_DONE PIXEL 0 R0.xy11 EOP
> ===== SHADER_END ===============================================================
Most computations are now done before the loop, including expensive SIN
and COS operations. Main loop body is now consists of 2 simple VLIW
instructions - additions and multiplications, instead of 6 generated by
LLVM.
As you can see, LLVM is not a magic tool that does all tricks without
additional efforts.
Why LLVM wasn't able to do the same? I don't know. Of course, I can
spend some time to investigate this, read LLVM code and figure out the
reason, and probably fix it somehow, maybe by adding custom
implementation. But I prefer to implement some simple algorithm that
works as I expect from the beginning and forget about it, instead of
investigating every case like with the example above. If there is some
special trick in LLVM that is missing in my branch, it's easier for me
to implement the same trick using some more simple algorithm, than to
try to make it work as I want with LLVM. If you have to spend your time
on this anyway, then what's a benefit of LLVM?
There is another problem in the example above - the sequence of MOV
instructions after the loop.
E.g. if you have the following code:
MOV R1.x, 5,
MOV R1.y, 5,
LLVM "optimizes" it :
MOV R1.x, 5
MOV R1.y, R1.x
The problem is that original variant can be executed on r600 as a single
VLIW instruction in a single cycle, second variant introduces data
dependency between instructions and now they have to be executed
sequentially, requiring two cycles. Why LLVM does that? Just because
it's good for some other architecture. LLVM doesn't know that in the
original code the instructions can be executed in parallel. So now you
have to spend your time to find a way to disable this "optimization"
that only makes the code worse for r600. To me it looks like a waste of
time.
I'm not saying that LLVM is completely useless - I'm sure it's very
useful for other (more conventional) architectures or when you also need
some additional standalone tools and can reuse a lot of existing code,
it's just not very helpful in this particular case - compilation of GL
shaders for the r600 architecture.
Regarding your program, using the LIBGL_SHOW_FPS variable to compare FPS
with Main.noprecompute gives me the following results on my HD5750:
R600_LLVM=0 R600_SB=0 : FPS = 144.1
R600_LLVM=1 R600_SB=0 : FPS = 288.1
R600_LLVM=1 R600_SB=1 : FPS = 518.4
R600_LLVM=0 R600_SB=1 : FPS = 527.2
Vadim
> In the long term I would vote for your knowledge about these machines
> available in llvm to get the best of both worlds.
>
> my 2 cents...
>
> Mathias
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
More information about the mesa-dev
mailing list