[Mesa-dev] [PATCH 0/5] nvc0: better instruction pipelining for Maxwell GPUs

Fri Jan 6 11:00:57 UTC 2017

On 01/06/2017 11:53 AM, Jan Vesely wrote:
> On Fri, 2016-12-23 at 00:15 +0100, Samuel Pitoiset wrote:
>> Hello,
>>
>> This series makes use of the scheduling control code in order to improve the
>> instruction pipelining on Maxwell GPUs.
>>
>> Starting with the Kepler architecture, where a control instruction has to be
>> inserted every 7 instructions, Maxwell added additional control codes and the
>> control instruction now has to be every 3 instructions. Maxwell control codes
>> are really powerful and well documented [1]. By the way, I would like to thank
>> Scott Gray who did an awesome reverse engineering work, although I had to
>> figure out the missing parts myself.
>>
>> On Maxwell, control codes are mainly used for setting the number of stall
>> counts and for producing/consumming dependency barriers in order to avoid
>> hazards. I'm not going to explain in details how do they work because the
>> documentation is quite good and because I added explanations here and there
>> in the source code. But the main thing to understand is that the previous
>> control code used by default (ie. st 0x0) means "wait for all dependencies
>> and stall the pipeline for 15 cycles which is the maximum".
>> Which is quite bad...
>>
>> Now, let's have a look at the (impressive) performance improvements. :-)
>> I measured on a GeForce GTX 750 Ti (GM107) reclocked to the highest perf level,
>> with and without the control codes (NV50_PROG_SCHED=0/1).
>>
>> app: number of FPS without -> number of FPS with (+gain%)
>>
>> FurMark:                   13  ->  42  (+223%)
>> Pixmark Piano:             2   ->  7   (+250%)
>> Pixmark Volposion:         6   ->  20  (+233%)
>> Julia F32:                 61  ->  219 (+259%)
>> LightMarks:                352 ->  685 (+94%)
>> Heaven (low):              51  ->  102 (+100%)
>> Heaven (ultra):            14  ->  27  (+93%)
>> Valley (low):              30  ->  68  (+126%)
>> Valley (ultra):            18  ->  39  (+100%)
>> Talos (low):               32  ->  50  (+56%)
>> Talos (ultra):             7   ->  14  (+100%)
>> Shadow of Mordor (lowest): 13  ->  20  (+53%)
>>
>> That's it! I think it's enough to understand the power of Maxwell control
>> codes. We may get additional numbers from Phoronix (wink, wink, Michael).
>> As I said in the main patch, the control codes can be disabled with
>> 'export NV50_PROG_SCHED=0'.
>>
>> Now, let's have a look how nouveau performs compared to NVIDIA's blob.
>>
>> FurMark:                   42  ->  59   (+40%)
>> Pixmark Piano:             7   ->  13   (+85%)
>> Pixmark Volposion:         20  ->  42   (+110%)
>> Julia F32:                 219 ->  351  (+60%)
>> LightMarks:                685 ->  1192 (+74%)
>> Heaven (low):              102 ->  144  (+41%)
>> Heaven (ultra):            27  ->  46   (+70%)
>> Valley (low):              68  ->  94   (+38%)
>> Valley (ultra):            39  ->  60   (+53%)
>> Talos (low):               50  ->  128  (+156%)
>> Talos (ultra):             14  ->  30   (+114%)
>> Shadow of Mordor (lowest): 20  ->  77   (+285%)
>
> I see + 45% and + 33% for my gm107m (prime) for Valley and Heaven
> (1024x768, medium). which pushes above the integrated skylake iGPU
> performance. There are visual artifacts in both demos, but they appear
> the same with and without these patches.
>
> Tested-by: Jan Vesely <jan.vesely at rutgers.edu>

Thanks for testing.

Yeah, there is a sync issue with at least Valley and Heaven on Maxwell.

If you try with MESA_DEBUG=flush, the visual artifacts no longer happen.

>
> regards,
> Jan
>
>>
>> Nouveau is still far away from the blob, but now I think Maxwell is actually
>> in roughly the same shape as Kepler in terms of performance and features.
>> Speaking about this, I will enable OpenGL 4.3 on Maxwell in a separate patch,
>> later on.
>>
>> The overhead at compile time added by this seris is rather small. For a full
>> shader-db run with my private repository of shaders, it takes approximately
>> 208s for compiling 25k shaders before the series and approximately 211s after.
>> Less than 2% of overhead and it's comparable to a full shader-db run on Kepler.
>>
>> No regressions with both piglit and dEQP (tested multiple times) and all
>> benchmarks/games I have tried render fine and seem to be quite stable.
>>
>> Due to a lack of time, some parts are still left to do and some others could
>> be improved. With the following ideas implemented I'm pretty sure we can
>> improve performance significantly.
>>
>> * Add support for the yield flag. This seems to be a hint to the hardware for
>>   improving how the work is balanced between the warps. I didn't figure out
>>   how and where to use it without breaking a bunch of things. Need time and
>>   patience.
>>
>> * Add support for dual-issue, the rules are pretty different than Kepler
>>   especially because of the dependency barriers. Note that the yield flag has
>>   to be set, otherwise the hardware won't dual-issue and in fact it will wait
>>   for all dependencies (ie. st 0x0) which is really different that what you
>>   are looking for.
>>
>> * Reduce stall counts. A bunch of instructions have a read latency which is the
>>   number of cycles before they can actually read the sources. This should be
>>   fairly easy to implement but will require some reverse engineering to
>>   completely understand the idea.
>>
>> This is my last contribution for the Nouveau driver for a while because I have
>> been hired by Valve to work on radeonsi. Do not expect such perf improvements
>> with radeonsi because it already performs really well, unlike Nouveau. But
>> with time and patience we can do better. :-)
>>
>> This series is also available from my fdo account:
>> https://cgit.freedesktop.org/~hakzsam/mesa/log/?h=gm107_scheduler
>>
>> Please, review!
>> Thanks.
>>
>> [1] https://github.com/NervanaSystems/maxas/wiki/Control-Codes
>>
>> Samuel Pitoiset (5):
>>   nv50/ir: do not insert texture barriers on gm107
>>   nv50/ir: improve instruction pipelining on gm107
>>   nv50/ir: use sched control codes for gm107 builtins
>>   nvc0: use sched control codes for gm107 blitter shader
>>   nvc0: use sched control codes for gm107 MP counters code
>>
>>  src/gallium/drivers/nouveau/codegen/lib/gm107.asm  |  40 +-
>>  .../drivers/nouveau/codegen/lib/gm107.asm.h        |  40 +-
>>  .../drivers/nouveau/codegen/nv50_ir_emit_gm107.cpp | 771 ++++++++++++++++++++-
>>  .../nouveau/codegen/nv50_ir_lowering_nvc0.cpp      |   3 +-
>>  .../nouveau/codegen/nv50_ir_target_gm107.cpp       | 253 +++++++
>>  .../drivers/nouveau/codegen/nv50_ir_target_gm107.h |   7 +
>>  .../drivers/nouveau/nvc0/nvc0_query_hw_sm.c        |  88 +--
>>  src/gallium/drivers/nouveau/nvc0/nvc0_surface.c    |  20 +-
>>  8 files changed, 1127 insertions(+), 95 deletions(-)
>>
>