callgrind / profiling ...

Michael Meeks michael.meeks at
Wed Dec 5 07:37:05 PST 2012

Hi guys,

	Forwarding Josef's rather interesting mail (with permission) to the dev
list - since it's generally useful :-) Stephan is working on plugging a
nice callgrinding script into tinderbox building so we can build an
automated performance tracker that will catch even small - 0.5% type
performance regressions which can accumulate over time - and do it in a
reproducible way, independent of the underlying hardware.

	Thanks so much for working on this Stephan; it seems the punch-line
formula for the magic number should be:

> Actually, what I have in recent versions of {Q,K}Cachegrind is this:
>  CEst = Ir + 10 Bm + 10 L1m + 20 Ge + 100 L2m + 100 LLm

	ATB !


-------- Forwarded Message --------


Am 04.12.2012 15:07, schrieb Julian Seward:
> Josef,
> Gruesse.  See below ..
> Michael:
> Josef is the expert on this, so forwarding it to him.
> It used to be "1 * #insns + 10 * #L1 misses + 100 * #L2 misses",
> but I wouldn't put too much reliance on it.  It was just something
> that (IIRC!) NickN and I kludged up during a coffee break probably
> about a decade ago at  Like so many other bits of
> Valgrind :-) (*)

AFAIK, Cachegrind always just printed out the raw event numbers.
Also, nowhere in Callgrind you will find that magic formula ;-)
I think I actually was the one who came up with that formula as
a rough "Cycle Estimation" in KCachegrind ;-)

Actually, doing a somehow accurate time estimation is quite tricky.
The rule of thumb should be: If you come up with an optimization
reducing miss numbers (and also the time estimation formula above),
always also check against real time.

In general, any estimation using cache metrics gets more accurate
the worse the cache behavior of the application, as time needed
for memory accesses is more predictable and hides any smaller latencies
within a core, especially with out-of-order execution, and even
ARM application cores are out-of-order nowadays (Cortex A9/A15).

The most accurate thing next to real measurement would be cycle-accurate
simulation, which of course is out of scope. There is some nice work
on a simpler, yet quite accurate processor timing model using so-called
"interval simulation", as implemented in the Sniper simulator
(see However, it's roughly an order of magnitude
slower than Cachegrind/Callgrind.

The biggest issues with the simple formulas above are:
(1) Hardware prefetching is not taken into account, which often allows
streaming of data at maximum bandwidth from memory,
(2) Latency hiding effects are not included, e.g. it easily can happen
that with lots of memory accesses, it does not matter if the code
actually was compiled with -O3 or -O0. Further, it can make
a huge difference whether two consecutive memory accesses depend on each 
other or not (ie. one hiding the latency of the other or not).
(3) Saturation effects are not taken into account. E.g. if all cores
on a multicore chip do memory accesses at the same time, the limited
bandwidth going off-chip will slow down each of them.

Thus, the accuracy of the simple formula very much depends on an
"average" application behavior, in the hope that the difficult
effects mentioned above do not play too much of a role each.

BTW, one can argue whether L1/LL misses on write should appear in
the formula at all. Writes usually are done by the hardware in the
background. However, if there is a memcpy operation doing a lot
of consecutive writes, write buffers will fill up, and will
throttle execution in the end.

 > That said, I am sure it got checked by Josef :-)

Despite above issues, I still think that it is useful to combine
the cache simulation results into one value. It makes it easier to
get an overview regarding the cache behavior.

The thing is: too make time estimation more accurate, you need to
extend the model, which results in slower simulation.
I have some ideas to make the cache model more accurate for time
estimation, but I am not sure it is worth it (real measurements
are faster anyway).

E.g. there is an optional hardware prefetch simulation in Callgrind.
We can introduce a new event "LL miss detected by prefetcher", and
give this e.g. a much lower penalty, depending on the bandwidth your
hardware can do: e.g. if you can do 1GB/s on a 1GHz processor, this
means 1 bytes on average per clock, thus with a 64byte line size
one line can be fetched every 64 cycles. As every LL miss will also
be a L1 miss (with penalty of 10), you would go for around 50 for
the penalty for the new prefetchable memory access event.

For latency effects, it gets more complicated as you want to check
if there is a dependency between memory accesses, and do something
similar to above mentioned interval simulation.

For the saturation effects, one needs to take simultaneous
execution of multiple threads into account.

> Josef: is there a more recent version of the pseudo-cycles formula?
> Also, maybe we can incorporate the number of global bus locks into
> the formula?  Or does the cost of a bus lock depend also on which
> cache the selected line is in?

Actually, what I have in recent versions of {Q,K}Cachegrind is this:
  CEst = Ir + 10 Bm + 10 L1m + 20 Ge + 100 L2m + 100 LLm

L2m is there for backward compatiblity with old VG versions.
If a given event type is not measured, the formula gets truncated

> I mention this because I remember at some time the LO crew were
> complaining that the cycle count from Callgrind was completely misleading
> due to not taking into account global bus locks.

Hm, ok. The "20 Ge" estimates that locking often can be resolved at the 
L2/L3 level. However, I never did any measurements myself. I think you
mentioned something ...


> J
> (*) eg.  the first version of the memcheck leak checker was written
> during a 4 hour plane-changing-stop at Chicago O'Hare.
> ----------  Forwarded Message  ----------
> Subject: Re: [Fwd: Re: callgrind / profiling bits ? ...]
> Date: Tuesday, December 04, 2012, 12:35:15 am
> From: Michael Meeks <michael.meeks at>
> 	Julian - incidentally - where does is the latest/greatest set of
> fudge-factors for predicting pseudo-cycles live from L1 / L2 misses
> etc. :-) We'd like to use that a spreadsheet we want to use to track
> incremental libreoffice performance (using callgrind of course).
> 	At least from my perspective, I'd -love- to see what impact simple
> day-to-day work has on performance particularly when repeatable: what do
> dozens of changes per day do to the various documents we have ? - I
> suspect slow bit-rot, with steps of improved perf. but ... ;-) perhaps
> publishing the graph will change that.

michael.meeks at  <><, Pseudo Engineer, itinerant idiot

More information about the LibreOffice mailing list