[Mesa-dev] [PATCH 00/25] RadeonSI: 1 variant per shader & shader cache in memory

Mon Feb 15 23:59:11 UTC 2016

Hi,

This patch series implements a new compilation mode that compiles shaders to hw bytecode only once with the assumption that any state-dependent code will be attached at the beginning or end of the bytecode to implement emulated features such as vertex buffer addressing, two-side color selection and interpolation, colorbuffer format conversions, alpha-test, etc. (the attachable bytecode will be called "prolog" and "epilog" shader parts, while the TGSI shader will be called the "main" part)

At the end, it adds a simple TGSI->bytecode shader cache that lives in memory.

1) Design points and differences from my XDC talk

The support of the old-style shaders compiled on demand (called "monolithic", because there is only one monolithic piece of bytecode) is kept. It can be enabled by an environment variable or it's enabled automatically if LLVM is < 3.8.

Shaders keep their shader key, but now the shader key is used to generate the prolog and epilog parts.

The main part is compiled first. At draw time, the prolog and epilog, if they are needed, are compiled and all pieces of bytecode are combined. Ideally, we would only be doing the combining at draw time, because everything should be compiled already.

Prologs and epilogs don't use the LLVM assembler as was planned initially. They share most of the code with monolithic shaders, meaning that each is compiled as an LLVM IR module.

The driver keeps a global per-screen list of all compiled prologs and epilogs, because they are all reusable.

If prolog and epilog compilation turns out to be too slow, we can precompile some of them with llc at Mesa compile time. I don't think this will be needed though.

VS and TES main parts are always compiled as hardware VS at shader creation. Hardware LS and ES stages are always compiled as monolithic shaders on demand later due to the lack of games using those.

2) Shader parts

VS prolog:
- vertex buffer address calculations based on instance divisors

VS epilog (hw VS only: VS & TES):
- primitive ID export if PS needs it
- in the future: ignore ClipVertex and ClipDistance outputs if clipping is disabled

TCS epilog:
- pack tessellation factors based on the TES primitive type

PS prolog:
- two-side color selection and interpolation
- forcing per-sample interpolation
- polygon stippling
- in the future: support BC_OPTIMIZE better, use interp_mov for flatshaded colors

PS epilog:
- alpha-test, alpha-to-one, smoothing, clamping, gl_FragColor broadcast
- color format conversions

3) Performance implications

There is increased VGPR usage because pixel shaders that used to use 4-12 VGPRs now always use 16 or even 20. This is not enough to affect the wave count though.

There is slightly higher register usage because some SGPRs and VGPRs have to be passed from the prolog through the main part to the epilog, so the main part has fewer of them. This results in higher SGPR spilling, although that should be entirely fixable in the LLVM backend.

Relevant shader-db stats for the default scheduler:

Code Size: 11091656 -> 11219948 (1.16 %) bytes
Scratch: 1732608 -> 2246656 (29.67 %) (SGPR spilling)
Max Waves: 78063 -> 77352 (-0.91 %)

Relevant shader-db stats for the SI scheduler:

Code Size: 11433068 -> 11535452 (0.90 %) bytes
Scratch: 509952 -> 522240 (2.41 %) (SGPR spilling)
Max Waves: 79456 -> 78217 (-1.56 %)

Both the code size and the wave count didn't change much. It looks like compiling optimized monolithic shaders in another thread won't make much difference.

No benchmarks have been run.

4) RadeonSI shader cache in memory

The motivation is to skip shader compilation for TGSI shaders that have already been compiled by the same process before. This is not a real shader cache that proprietary drivers implement. The binaries are not stored on the disk. The motivations are:
- Apps mix and match their vertex and pixel shaders to produce many combinations of linked GLSL shader programs. E.g. if one VS is matched with 20 pixel shaders, we don't want to compile that VS 20 times. This does appear to happen a lot with UE3.
- If apps unload and reload shaders, this effectively makes the reload free for the radeonsi driver. (not so much for st/mesa)
- Gallium likes to use the same blit & pass-through shaders in several places.

This only caches the main shader parts (VS as VS, TCS, TES as VS, PS). Monolithic shaders including LS & ES and also GS are not cached.

5) Performance of the shader cache

The test is a short apitrace of Borderlands 2.

Without the cache:
GLSL link time = 18361 ms
Driver compile time = 14510 ms

With the cache:
GLSL link time = 12576 ms
Driver compile time = 8552 ms

This leaves a lot to be desired, but it was expected. The TGSI compilation takes 41% less time, which means 41% of all TGSI shaders are duplicates. On average, linking GLSL shader programs (including the TGSI compilation) takes 31.5% less time.

The compile times are still unacceptable and caching shaders on the disk appears to be a necessity. A radeonsi-only cache on the disk should be relatively easy with the current cache in memory, but 33% of the compilation time is not spent in radeonsi.

6) Piglit regressions

Since shaders are now always compiled all the way to the bytecode by glLinkProgram, it uncovered a few glsl_to_tgsi bugs creating invalid TGSI shaders and failing assertions in the driver.

Please review.

Marek