[Mesa-dev] Require micro-benchmarks for performance optimization oriented patches

Thu Nov 20 08:46:50 PST 2014

Hi,

 > Honestly, I think I'm okay with our usual metrics like:
 > - Increased FPS in a game or benchmark
 > - Reduced number of instructions or memory accesses in
     a shader program
 > - Reduced memory consumption
 > - Significant cycle reduction in callgrind or better generated code
 >   (ideally if it's a lot of code I'd like better justification)

Profiling tools like callgrind are means for analyzing, not for
measuring.

The problem with profiler data is that the cost may have just
been moved elsewhere, *and* grown:

* Kcachegrind visualization for valgrind/callgrind data shows call
   counts and relative performance.  If relative cost of a given
   function has decreased, that still doesn't tell anything about:
   - its absolute cost, i.e.
   - whether cost just moved somewhere else instead of total cost
     really decreasing

* Callgrind reports instruction counts, not cycles.  While
   they're a good indicator, it doesn't necessarily tell about
   real performance (instruction count e.g. doesn't take into
   account data or instruction cache misses)

* Valgrind tracks only CPU utilization on the user-space.
   It doesn't notice increased CPU utilization at kernel side.

* Valgrind tracks only single process, it doesn't notice
   increased CPU utilization in other processes (in graphics
   perf, X server side is sometimes relevant)

* Valgrind doesn't track GPU utilization.   Change may have
   moved more load there.

-> Looking just at Callgrind data is NOT enough, there must
also be some real measurement data.

As to measurements...

Even large performance improvements fail to show up with
the wrong benchmark.  One needs to know whether test-case
performance is bound by what you were trying to optimize.

If there's no such case known, simplest may be just to write
a micro-benchmark for the change (preferably two, one for best
and one for the worst case).

     - Eero

PS. while analyzing memory usage isn't harder, measuring that
is, because memory usage can be shared (both in user-space and
on kernel buffers), and there's a large impact difference on
whether memory is clean or dirty (unless your problem is running
out of 32-bit address space when clean memory is as much of a problem).