[Mesa-dev] [PATCH] glsl: optimize list handling in opt_dead_code

Wed Oct 19 13:55:35 UTC 2016

Hi,

On 18.10.2016 20:12, Jan Ziak wrote:
[...]
>> Never profile with -O0 or disabled function inlining.
>
> Seriously?

Nobody's going to take seriously optimization results taken from 
non-optimized builds.

>> Mesa uses -g -O2
>> with --enable-debug, so that's what you should use too. Don't use any
>> other -O* variants.
>
> What if I find a case where -O2 prevents me from easily seeing
> information necessary to optimize the source code?

I've never had that as a problem after the ARM call unwinding was solved 
in GCC and profiling tools decade ago (Valgrind even has some patches 
for that from the work we were doing at my previous employer).

You need to know what compiler optimizations do, to better understand 
the shown data, but tools can nowadays e.g. show inlined code correctly. 
  In general, if you have problems with optimized builds, either your 
tools or your builds are broken.

(C++ does make things a  bit more difficult because there's *much* more 
inlining happening with compiler optimizations on C++ code.

(Rest of the mail is general comments on profiling, not so much aimed 
for you or Marek, I assume you both already know that stuff.)

>> The only profiling tools reporting correct results are perf and
>> sysprof.

Perf uses sampling and reports averages.  While perf varies the sampling 
rate, sampling can still misrepresent some things (small frequently 
called things), and averages aren't good for everything.

That's why one should *also* use something like Valgrind which doesn't 
miss things (although it cannot accurately estimate how much time they 
take), so that you can see all call chains & call counts.

This isn't about latency, but for that good Intel PT based tool would be 
most correct.  Like the data provided by ARM ETM interface, it's very 
awkward to use though (GBs of data to process, tools not open source etc).

> I used perf on Metro 2033 Redux and saw do_dead_code() there. Then I
> used callgrind to see some more code.
>
>> (both use the same mechanism) If you don't enable dwarf in
>> perf (also sysprof can't use dwarf), you have to build Mesa with
>> -fno-omit-frame-pointer to see call trees. The only reason you would
>> want to enable dwarf-based call trees is when you want to see libc
>> calls. Otherwise, they won't be displayed or counted as part of call
>> trees. For Mesa developers who do profiling often,
>> -fno-omit-frame-pointer should be your default.
>
>> Callgrind counts calls (that one you can trust), but the reported time
>> is incorrect,

Callgrind reports number of instructions, not time.

Cachegrind can provide estimates for how much time is taken, but as you 
mentioned, it's not very reliable (while one can specify similar cache 
sizes as the target machine has, the cache model is inaccurate, and I 
don't think it counts SIMD code correctly).

Both report this data only for the user-space process, not for the work 
the process requests from the kernel.

[...]

	- Eero