[Mesa-dev] [PATCH 0/5] nvc0: better instruction pipelining for Maxwell GPUs

Thu Dec 22 23:15:55 UTC 2016

Hello,

This series makes use of the scheduling control code in order to improve the
instruction pipelining on Maxwell GPUs.

Starting with the Kepler architecture, where a control instruction has to be
inserted every 7 instructions, Maxwell added additional control codes and the
control instruction now has to be every 3 instructions. Maxwell control codes
are really powerful and well documented [1]. By the way, I would like to thank
Scott Gray who did an awesome reverse engineering work, although I had to
figure out the missing parts myself.

On Maxwell, control codes are mainly used for setting the number of stall
counts and for producing/consumming dependency barriers in order to avoid
hazards. I'm not going to explain in details how do they work because the
documentation is quite good and because I added explanations here and there
in the source code. But the main thing to understand is that the previous
control code used by default (ie. st 0x0) means "wait for all dependencies
and stall the pipeline for 15 cycles which is the maximum".
Which is quite bad...

Now, let's have a look at the (impressive) performance improvements. :-)
I measured on a GeForce GTX 750 Ti (GM107) reclocked to the highest perf level,
with and without the control codes (NV50_PROG_SCHED=0/1).

app: number of FPS without -> number of FPS with (+gain%)

FurMark:                   13  ->  42  (+223%)
Pixmark Piano:             2   ->  7   (+250%)
Pixmark Volposion:         6   ->  20  (+233%)
Julia F32:                 61  ->  219 (+259%)
LightMarks:                352 ->  685 (+94%)
Heaven (low):              51  ->  102 (+100%)
Heaven (ultra):            14  ->  27  (+93%)
Valley (low):              30  ->  68  (+126%)
Valley (ultra):            18  ->  39  (+100%)
Talos (low):               32  ->  50  (+56%)
Talos (ultra):             7   ->  14  (+100%)
Shadow of Mordor (lowest): 13  ->  20  (+53%)

That's it! I think it's enough to understand the power of Maxwell control
codes. We may get additional numbers from Phoronix (wink, wink, Michael).
As I said in the main patch, the control codes can be disabled with
'export NV50_PROG_SCHED=0'.

Now, let's have a look how nouveau performs compared to NVIDIA's blob.

FurMark:                   42  ->  59   (+40%)
Pixmark Piano:             7   ->  13   (+85%)
Pixmark Volposion:         20  ->  42   (+110%)
Julia F32:                 219 ->  351  (+60%)
LightMarks:                685 ->  1192 (+74%)
Heaven (low):              102 ->  144  (+41%)
Heaven (ultra):            27  ->  46   (+70%)
Valley (low):              68  ->  94   (+38%)
Valley (ultra):            39  ->  60   (+53%)
Talos (low):               50  ->  128  (+156%)
Talos (ultra):             14  ->  30   (+114%)
Shadow of Mordor (lowest): 20  ->  77   (+285%)

Nouveau is still far away from the blob, but now I think Maxwell is actually 
in roughly the same shape as Kepler in terms of performance and features.
Speaking about this, I will enable OpenGL 4.3 on Maxwell in a separate patch,
later on.

The overhead at compile time added by this seris is rather small. For a full
shader-db run with my private repository of shaders, it takes approximately
208s for compiling 25k shaders before the series and approximately 211s after.
Less than 2% of overhead and it's comparable to a full shader-db run on Kepler.

No regressions with both piglit and dEQP (tested multiple times) and all
benchmarks/games I have tried render fine and seem to be quite stable.

Due to a lack of time, some parts are still left to do and some others could
be improved. With the following ideas implemented I'm pretty sure we can
improve performance significantly.

* Add support for the yield flag. This seems to be a hint to the hardware for
  improving how the work is balanced between the warps. I didn't figure out
  how and where to use it without breaking a bunch of things. Need time and
  patience.

* Add support for dual-issue, the rules are pretty different than Kepler 
  especially because of the dependency barriers. Note that the yield flag has
  to be set, otherwise the hardware won't dual-issue and in fact it will wait
  for all dependencies (ie. st 0x0) which is really different that what you
  are looking for.

* Reduce stall counts. A bunch of instructions have a read latency which is the
  number of cycles before they can actually read the sources. This should be
  fairly easy to implement but will require some reverse engineering to
  completely understand the idea.

This is my last contribution for the Nouveau driver for a while because I have
been hired by Valve to work on radeonsi. Do not expect such perf improvements
with radeonsi because it already performs really well, unlike Nouveau. But
with time and patience we can do better. :-)

This series is also available from my fdo account:
https://cgit.freedesktop.org/~hakzsam/mesa/log/?h=gm107_scheduler

Please, review!
Thanks.

[1] https://github.com/NervanaSystems/maxas/wiki/Control-Codes

Samuel Pitoiset (5):
  nv50/ir: do not insert texture barriers on gm107
  nv50/ir: improve instruction pipelining on gm107
  nv50/ir: use sched control codes for gm107 builtins
  nvc0: use sched control codes for gm107 blitter shader
  nvc0: use sched control codes for gm107 MP counters code

 src/gallium/drivers/nouveau/codegen/lib/gm107.asm  |  40 +-
 .../drivers/nouveau/codegen/lib/gm107.asm.h        |  40 +-
 .../drivers/nouveau/codegen/nv50_ir_emit_gm107.cpp | 771 ++++++++++++++++++++-
 .../nouveau/codegen/nv50_ir_lowering_nvc0.cpp      |   3 +-
 .../nouveau/codegen/nv50_ir_target_gm107.cpp       | 253 +++++++
 .../drivers/nouveau/codegen/nv50_ir_target_gm107.h |   7 +
 .../drivers/nouveau/nvc0/nvc0_query_hw_sm.c        |  88 +--
 src/gallium/drivers/nouveau/nvc0/nvc0_surface.c    |  20 +-
 8 files changed, 1127 insertions(+), 95 deletions(-)

-- 
2.11.0