[Mesa-dev] [PATCH 0/5] nvc0: better instruction pipelining for Maxwell GPUs
samuel.pitoiset at gmail.com
Thu Dec 22 23:15:55 UTC 2016
This series makes use of the scheduling control code in order to improve the
instruction pipelining on Maxwell GPUs.
Starting with the Kepler architecture, where a control instruction has to be
inserted every 7 instructions, Maxwell added additional control codes and the
control instruction now has to be every 3 instructions. Maxwell control codes
are really powerful and well documented . By the way, I would like to thank
Scott Gray who did an awesome reverse engineering work, although I had to
figure out the missing parts myself.
On Maxwell, control codes are mainly used for setting the number of stall
counts and for producing/consumming dependency barriers in order to avoid
hazards. I'm not going to explain in details how do they work because the
documentation is quite good and because I added explanations here and there
in the source code. But the main thing to understand is that the previous
control code used by default (ie. st 0x0) means "wait for all dependencies
and stall the pipeline for 15 cycles which is the maximum".
Which is quite bad...
Now, let's have a look at the (impressive) performance improvements. :-)
I measured on a GeForce GTX 750 Ti (GM107) reclocked to the highest perf level,
with and without the control codes (NV50_PROG_SCHED=0/1).
app: number of FPS without -> number of FPS with (+gain%)
FurMark: 13 -> 42 (+223%)
Pixmark Piano: 2 -> 7 (+250%)
Pixmark Volposion: 6 -> 20 (+233%)
Julia F32: 61 -> 219 (+259%)
LightMarks: 352 -> 685 (+94%)
Heaven (low): 51 -> 102 (+100%)
Heaven (ultra): 14 -> 27 (+93%)
Valley (low): 30 -> 68 (+126%)
Valley (ultra): 18 -> 39 (+100%)
Talos (low): 32 -> 50 (+56%)
Talos (ultra): 7 -> 14 (+100%)
Shadow of Mordor (lowest): 13 -> 20 (+53%)
That's it! I think it's enough to understand the power of Maxwell control
codes. We may get additional numbers from Phoronix (wink, wink, Michael).
As I said in the main patch, the control codes can be disabled with
Now, let's have a look how nouveau performs compared to NVIDIA's blob.
FurMark: 42 -> 59 (+40%)
Pixmark Piano: 7 -> 13 (+85%)
Pixmark Volposion: 20 -> 42 (+110%)
Julia F32: 219 -> 351 (+60%)
LightMarks: 685 -> 1192 (+74%)
Heaven (low): 102 -> 144 (+41%)
Heaven (ultra): 27 -> 46 (+70%)
Valley (low): 68 -> 94 (+38%)
Valley (ultra): 39 -> 60 (+53%)
Talos (low): 50 -> 128 (+156%)
Talos (ultra): 14 -> 30 (+114%)
Shadow of Mordor (lowest): 20 -> 77 (+285%)
Nouveau is still far away from the blob, but now I think Maxwell is actually
in roughly the same shape as Kepler in terms of performance and features.
Speaking about this, I will enable OpenGL 4.3 on Maxwell in a separate patch,
The overhead at compile time added by this seris is rather small. For a full
shader-db run with my private repository of shaders, it takes approximately
208s for compiling 25k shaders before the series and approximately 211s after.
Less than 2% of overhead and it's comparable to a full shader-db run on Kepler.
No regressions with both piglit and dEQP (tested multiple times) and all
benchmarks/games I have tried render fine and seem to be quite stable.
Due to a lack of time, some parts are still left to do and some others could
be improved. With the following ideas implemented I'm pretty sure we can
improve performance significantly.
* Add support for the yield flag. This seems to be a hint to the hardware for
improving how the work is balanced between the warps. I didn't figure out
how and where to use it without breaking a bunch of things. Need time and
* Add support for dual-issue, the rules are pretty different than Kepler
especially because of the dependency barriers. Note that the yield flag has
to be set, otherwise the hardware won't dual-issue and in fact it will wait
for all dependencies (ie. st 0x0) which is really different that what you
are looking for.
* Reduce stall counts. A bunch of instructions have a read latency which is the
number of cycles before they can actually read the sources. This should be
fairly easy to implement but will require some reverse engineering to
completely understand the idea.
This is my last contribution for the Nouveau driver for a while because I have
been hired by Valve to work on radeonsi. Do not expect such perf improvements
with radeonsi because it already performs really well, unlike Nouveau. But
with time and patience we can do better. :-)
This series is also available from my fdo account:
Samuel Pitoiset (5):
nv50/ir: do not insert texture barriers on gm107
nv50/ir: improve instruction pipelining on gm107
nv50/ir: use sched control codes for gm107 builtins
nvc0: use sched control codes for gm107 blitter shader
nvc0: use sched control codes for gm107 MP counters code
src/gallium/drivers/nouveau/codegen/lib/gm107.asm | 40 +-
.../drivers/nouveau/codegen/lib/gm107.asm.h | 40 +-
.../drivers/nouveau/codegen/nv50_ir_emit_gm107.cpp | 771 ++++++++++++++++++++-
.../nouveau/codegen/nv50_ir_lowering_nvc0.cpp | 3 +-
.../nouveau/codegen/nv50_ir_target_gm107.cpp | 253 +++++++
.../drivers/nouveau/codegen/nv50_ir_target_gm107.h | 7 +
.../drivers/nouveau/nvc0/nvc0_query_hw_sm.c | 88 +--
src/gallium/drivers/nouveau/nvc0/nvc0_surface.c | 20 +-
8 files changed, 1127 insertions(+), 95 deletions(-)
More information about the mesa-dev