Mesa (master): intel/ir: Import shader performance analysis pass.

Wed Apr 29 06:49:41 UTC 2020

Module: Mesa
Branch: master
Commit: 188a3659aea6dec9acf1c2fd15fcaecffe4f7d4e
URL:    http://cgit.freedesktop.org/mesa/mesa/commit/?id=188a3659aea6dec9acf1c2fd15fcaecffe4f7d4e

Author: Francisco Jerez <currojerez at riseup.net>
Date:   Thu Mar 26 14:59:02 2020 -0700

intel/ir: Import shader performance analysis pass.

This introduces an analysis pass intended to estimate several
performance statistics of the shader, including cycle count latency
and throughput values, based on static modeling.  It has instruction
performance information more comprehensive than the current scheduling
pass for all platforms between Gen4-11, and works on both the FS and
VEC4 back-end.

The most immediate purpose of this pass is to implement a heuristic
meant to determine whether using SIMD32 dispatch for a fragment shader
can be expected to help more than it hurts.  In addition this will
allow the effect of passes run after scheduling (e.g. the TGL software
scoreboard pass and the VEC4 dependency control pass) to be visible in
shader-db statistics.

But that isn't the end of the story, other potential applications of
this pass (not part of this MR) I've been playing around with are:

 - Implement a similar SIMD16 heuristic allowing the identification of
   inefficient SIMD16 fragment shaders.

 - Implement similar SIMD16 and SIMD32 heuristics for the compute
   shader stage -- Currently compute shader builds always use the
   SIMD16 shader if available and never use the SIMD32 shader unless
   strictly necessary, which is suboptimal under certain conditions.

 - Hook up to the instruction scheduler in order to improve the
   accuracy of its timing information.

 - Use as heuristic in order to drive the selection of scheduling
   modes (Matt was experimenting with that).

 - Plug to the TGL software scoreboard pass in order to implement a
   more effective SBID token allocation algorithm, since in general
   the optimal token allocation depends on the timings of all
   instructions in the program.

 - Use its bottleneck detection functionality in order to implement a
   heuristic computing a more optimal bound for the number of fragment
   shader threads executed in parallel (by adjusting the
   MaximumNumberofThreadsPerPSD control of 3DSTATE_PS).

As a follow-up I'm planning to submit updated timing information for
Gen12 platforms -- Everything else required to support Gen12 like SWSB
handling is already included in this patch, but there were some IP
concerns regarding the TGL timing parameters since they cannot
currently be obtained with the documentation and hardware which is
publicly available.  The timing parameters for any previous Gen7-11
platforms can be obtained by anyone by sampling the timestamp register
using e.g. shader_time, though I have some more convenient
instrumentation coming up.

Reviewed-by: Kenneth Graunke <kenneth at whitecape.org>

---

 src/intel/Makefile.sources                |    2 +
 src/intel/compiler/brw_fs.h               |    3 +
 src/intel/compiler/brw_fs_visitor.cpp     |    2 +
 src/intel/compiler/brw_ir_performance.cpp | 1561 +++++++++++++++++++++++++++++
 src/intel/compiler/brw_ir_performance.h   |   86 ++
 src/intel/compiler/brw_vec4.h             |    3 +
 src/intel/compiler/brw_vec4_visitor.cpp   |    2 +-
 src/intel/compiler/meson.build            |    2 +
 8 files changed, 1660 insertions(+), 1 deletion(-)

Diff:   http://cgit.freedesktop.org/mesa/mesa/diff/?id=188a3659aea6dec9acf1c2fd15fcaecffe4f7d4e