[Mesa-dev] [PATCH v5 0/6] nv50/ir: Improve Performance of Integer Multiplication

Sat Aug 18 20:21:15 UTC 2018

Changes in v5:
- split the 4th patch into 3 patches
- the 4th/last three patches handles 64-bit multiplications
- rely on LateAlgebraicOpt to create a shladd for the near-power-of-two
  optimization
- set immediate to true in emitXMAD
Changes in v4:
- remove uint16_t(...) in nv50_ir.h
- change XMAD immediate size from signed 20 bit to unsigned 16 bit
- rework the 4th patch
Changes in v3:
- stylistic changes
- simplify createMulMethod2()
- update shader-db statistics
- use util_bitcount64 and util_next_power_of_two64 instead of
  reimplementing them
Changes in v2:
- rebase
- bring back constant folding for multiplication by power-of-twos for nv50
- remove TODO in nv50_ir_target_gm107.cpp
- document XMAD's flags
- change how XMAD's per-operand flags are represented
- move util/bitscan.h stuff into a new patch
- stylistic changes

This series improve the performance of integer multiplication by removing
much usage of the very slow IMAD and IMUL on Maxwell+ and improving
multiplication by immediates on Fermi+.

The first and second patch add support for the XMAD instruction in codegen

The third patch replaces most IMADs and IMULs with a sequence of XMADs on
Maxwell+. This is far faster but increases the total instructions in the
shader-db by 0.90%, gpr count by 0.10% and local memory by 0.46%.

The next three patches patch significantly lowers this number. It replaces
many multiplications by immediates with instructions that should be as fast
or faster than the generic approach. They are also typically smaller and
less register heavy, so they decrease the total instruction count by -0.59%
and bring the gpr count and local memory back to normal.

This series gives about a ~50% speedup in fragment-heavy scenaries with
Dolphin 5.0 on my GTX 1060. All timings were made with interesting looking
fifos from Dolphin's bugtracker:
     Wind Waker: 18 FPS -> 26 FPS at 3x internal resolution
     Wind Waker:  8 FPS -> 11 FPS at 5x internal resolution
   Paper Mario?: 26 FPS -> 42 FPS at 5x internal resolution
SpongeBob Movie: 19 FPS -> 30 FPS at 5x internal resolution

It also gives a 9.79% improvement with Hitman and a 0.57% to 1.14% improvement.

Unigine Heaven and Unigine Valley seems to run the same at low quality with
no anti-aliasing and no tessellation. SuperTuxKart and 0 A.D. also show no
change.

These patches can also be found on my github:
https://github.com/pendingchaos/mesa/tree/nv-xmad-v5

The final changes in shader-db are as follows:

total instructions in shared programs : 5768871 -> 5786560 (0.31%)
total gprs used in shared programs    : 669919 -> 669968 (0.01%)
total shared used in shared programs  : 548832 -> 548832 (0.00%)
total local used in shared programs   : 21068 -> 21068 (0.00%)

                local     shared        gpr       inst      bytes 
    helped           0           0         232        1009        1009 
      hurt           0           0         238        2164        2164 

Rhys Perry (6):
  nv50/ir: add preliminary support for OP_XMAD
  gm107/ir: add support for OP_XMAD on GM107+
  nv50/ir: optimize imul/imad to xmads
  nv50/ir: move a * b -> a << log2(b) code into createMul()
  nv50/ir: optimize near power-of-twos into shladd
  nv50/ir: optimize multiplication by 16-bit immediates into two xmads

 src/gallium/drivers/nouveau/codegen/nv50_ir.h      |  26 ++++
 .../drivers/nouveau/codegen/nv50_ir_emit_gm107.cpp |  65 +++++++++
 .../drivers/nouveau/codegen/nv50_ir_peephole.cpp   | 156 ++++++++++++++++++---
 .../drivers/nouveau/codegen/nv50_ir_print.cpp      |  19 +++
 .../drivers/nouveau/codegen/nv50_ir_target.cpp     |   7 +-
 .../nouveau/codegen/nv50_ir_target_gm107.cpp       |   6 +-
 .../nouveau/codegen/nv50_ir_target_nv50.cpp        |   1 +
 .../nouveau/codegen/nv50_ir_target_nvc0.cpp        |  19 +++
 8 files changed, 278 insertions(+), 21 deletions(-)

-- 
2.14.4