[Mesa-dev] [PATCH 0/5] nvc0: better instruction pipelining for Maxwell GPUs
samuel.pitoiset at gmail.com
Thu Jan 12 14:26:34 UTC 2017
Just pushed the series before the branchpoint. :-)
If someone want to do benchmarks, make sure to use Linux 4.10 with
pstate 0f. The sched control codes are enabled by default but they can
be disabled by setting NV50_PROG_SCHED=0 (for comparison purposes and
On 12/23/2016 12:15 AM, Samuel Pitoiset wrote:
> This series makes use of the scheduling control code in order to improve the
> instruction pipelining on Maxwell GPUs.
> Starting with the Kepler architecture, where a control instruction has to be
> inserted every 7 instructions, Maxwell added additional control codes and the
> control instruction now has to be every 3 instructions. Maxwell control codes
> are really powerful and well documented . By the way, I would like to thank
> Scott Gray who did an awesome reverse engineering work, although I had to
> figure out the missing parts myself.
> On Maxwell, control codes are mainly used for setting the number of stall
> counts and for producing/consumming dependency barriers in order to avoid
> hazards. I'm not going to explain in details how do they work because the
> documentation is quite good and because I added explanations here and there
> in the source code. But the main thing to understand is that the previous
> control code used by default (ie. st 0x0) means "wait for all dependencies
> and stall the pipeline for 15 cycles which is the maximum".
> Which is quite bad...
> Now, let's have a look at the (impressive) performance improvements. :-)
> I measured on a GeForce GTX 750 Ti (GM107) reclocked to the highest perf level,
> with and without the control codes (NV50_PROG_SCHED=0/1).
> app: number of FPS without -> number of FPS with (+gain%)
> FurMark: 13 -> 42 (+223%)
> Pixmark Piano: 2 -> 7 (+250%)
> Pixmark Volposion: 6 -> 20 (+233%)
> Julia F32: 61 -> 219 (+259%)
> LightMarks: 352 -> 685 (+94%)
> Heaven (low): 51 -> 102 (+100%)
> Heaven (ultra): 14 -> 27 (+93%)
> Valley (low): 30 -> 68 (+126%)
> Valley (ultra): 18 -> 39 (+100%)
> Talos (low): 32 -> 50 (+56%)
> Talos (ultra): 7 -> 14 (+100%)
> Shadow of Mordor (lowest): 13 -> 20 (+53%)
> That's it! I think it's enough to understand the power of Maxwell control
> codes. We may get additional numbers from Phoronix (wink, wink, Michael).
> As I said in the main patch, the control codes can be disabled with
> 'export NV50_PROG_SCHED=0'.
> Now, let's have a look how nouveau performs compared to NVIDIA's blob.
> FurMark: 42 -> 59 (+40%)
> Pixmark Piano: 7 -> 13 (+85%)
> Pixmark Volposion: 20 -> 42 (+110%)
> Julia F32: 219 -> 351 (+60%)
> LightMarks: 685 -> 1192 (+74%)
> Heaven (low): 102 -> 144 (+41%)
> Heaven (ultra): 27 -> 46 (+70%)
> Valley (low): 68 -> 94 (+38%)
> Valley (ultra): 39 -> 60 (+53%)
> Talos (low): 50 -> 128 (+156%)
> Talos (ultra): 14 -> 30 (+114%)
> Shadow of Mordor (lowest): 20 -> 77 (+285%)
> Nouveau is still far away from the blob, but now I think Maxwell is actually
> in roughly the same shape as Kepler in terms of performance and features.
> Speaking about this, I will enable OpenGL 4.3 on Maxwell in a separate patch,
> later on.
> The overhead at compile time added by this seris is rather small. For a full
> shader-db run with my private repository of shaders, it takes approximately
> 208s for compiling 25k shaders before the series and approximately 211s after.
> Less than 2% of overhead and it's comparable to a full shader-db run on Kepler.
> No regressions with both piglit and dEQP (tested multiple times) and all
> benchmarks/games I have tried render fine and seem to be quite stable.
> Due to a lack of time, some parts are still left to do and some others could
> be improved. With the following ideas implemented I'm pretty sure we can
> improve performance significantly.
> * Add support for the yield flag. This seems to be a hint to the hardware for
> improving how the work is balanced between the warps. I didn't figure out
> how and where to use it without breaking a bunch of things. Need time and
> * Add support for dual-issue, the rules are pretty different than Kepler
> especially because of the dependency barriers. Note that the yield flag has
> to be set, otherwise the hardware won't dual-issue and in fact it will wait
> for all dependencies (ie. st 0x0) which is really different that what you
> are looking for.
> * Reduce stall counts. A bunch of instructions have a read latency which is the
> number of cycles before they can actually read the sources. This should be
> fairly easy to implement but will require some reverse engineering to
> completely understand the idea.
> This is my last contribution for the Nouveau driver for a while because I have
> been hired by Valve to work on radeonsi. Do not expect such perf improvements
> with radeonsi because it already performs really well, unlike Nouveau. But
> with time and patience we can do better. :-)
> This series is also available from my fdo account:
> Please, review!
>  https://github.com/NervanaSystems/maxas/wiki/Control-Codes
> Samuel Pitoiset (5):
> nv50/ir: do not insert texture barriers on gm107
> nv50/ir: improve instruction pipelining on gm107
> nv50/ir: use sched control codes for gm107 builtins
> nvc0: use sched control codes for gm107 blitter shader
> nvc0: use sched control codes for gm107 MP counters code
> src/gallium/drivers/nouveau/codegen/lib/gm107.asm | 40 +-
> .../drivers/nouveau/codegen/lib/gm107.asm.h | 40 +-
> .../drivers/nouveau/codegen/nv50_ir_emit_gm107.cpp | 771 ++++++++++++++++++++-
> .../nouveau/codegen/nv50_ir_lowering_nvc0.cpp | 3 +-
> .../nouveau/codegen/nv50_ir_target_gm107.cpp | 253 +++++++
> .../drivers/nouveau/codegen/nv50_ir_target_gm107.h | 7 +
> .../drivers/nouveau/nvc0/nvc0_query_hw_sm.c | 88 +--
> src/gallium/drivers/nouveau/nvc0/nvc0_surface.c | 20 +-
> 8 files changed, 1127 insertions(+), 95 deletions(-)
More information about the mesa-dev