The next couple of days I'll work on the automation of producing a graphing of this "magic number". Also I'm doing repeat testing now to get a feeling of the kind of spread in the numbers that cachegrind is producing. Stephan <div class="gmail_extra"> <div class="gmail_quote">2012/12/5 Michael Meeks <<a href="mailto:michael.meeks@suse.com" target="_blank">michael.meeks@suse.com</a>> <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Hi guys, Forwarding Josef's rather interesting mail (with permission) to the dev list - since it's generally useful :-) Stephan is working on plugging a nice callgrinding script into tinderbox building so we can build an automated performance tracker that will catch even small - 0.5% type performance regressions which can accumulate over time - and do it in a reproducible way, independent of the underlying hardware. Thanks so much for working on this Stephan; it seems the punch-line formula for the magic number should be: > Actually, what I have in recent versions of {Q,K}Cachegrind is this: > CEst = Ir + 10 Bm + 10 L1m + 20 Ge + 100 L2m + 100 LLm ATB ! Michael. -------- Forwarded Message -------- Hi, Am 04.12.2012 15:07, schrieb Julian Seward: > > Josef, > Gruesse. See below .. > > Michael: > Josef is the expert on this, so forwarding it to him. > > It used to be "1 * #insns + 10 * #L1 misses + 100 * #L2 misses", > but I wouldn't put too much reliance on it. It was just something > that (IIRC!) NickN and I kludged up during a coffee break probably > about a decade ago at <a href="http://cl.cam.ac.uk" target="_blank">cl.cam.ac.uk</a>. Like so many other bits of > Valgrind :-) (*) AFAIK, Cachegrind always just printed out the raw event numbers. Also, nowhere in Callgrind you will find that magic formula ;-) I think I actually was the one who came up with that formula as a rough "Cycle Estimation" in KCachegrind ;-) Actually, doing a somehow accurate time estimation is quite tricky. The rule of thumb should be: If you come up with an optimization reducing miss numbers (and also the time estimation formula above), always also check against real time. In general, any estimation using cache metrics gets more accurate the worse the cache behavior of the application, as time needed for memory accesses is more predictable and hides any smaller latencies within a core, especially with out-of-order execution, and even ARM application cores are out-of-order nowadays (Cortex A9/A15). The most accurate thing next to real measurement would be cycle-accurate simulation, which of course is out of scope. There is some nice work on a simpler, yet quite accurate processor timing model using so-called "interval simulation", as implemented in the Sniper simulator (see <a href="http://snipersim.org" target="_blank">http://snipersim.org</a>). However, it's roughly an order of magnitude slower than Cachegrind/Callgrind. The biggest issues with the simple formulas above are: (1) Hardware prefetching is not taken into account, which often allows streaming of data at maximum bandwidth from memory, (2) Latency hiding effects are not included, e.g. it easily can happen that with lots of memory accesses, it does not matter if the code actually was compiled with -O3 or -O0. Further, it can make a huge difference whether two consecutive memory accesses depend on each other or not (ie. one hiding the latency of the other or not). (3) Saturation effects are not taken into account. E.g. if all cores on a multicore chip do memory accesses at the same time, the limited bandwidth going off-chip will slow down each of them. Thus, the accuracy of the simple formula very much depends on an "average" application behavior, in the hope that the difficult effects mentioned above do not play too much of a role each. BTW, one can argue whether L1/LL misses on write should appear in the formula at all. Writes usually are done by the hardware in the background. However, if there is a memcpy operation doing a lot of consecutive writes, write buffers will fill up, and will throttle execution in the end. > That said, I am sure it got checked by Josef :-) Despite above issues, I still think that it is useful to combine the cache simulation results into one value. It makes it easier to get an overview regarding the cache behavior. The thing is: too make time estimation more accurate, you need to extend the model, which results in slower simulation. I have some ideas to make the cache model more accurate for time estimation, but I am not sure it is worth it (real measurements are faster anyway). E.g. there is an optional hardware prefetch simulation in Callgrind. We can introduce a new event "LL miss detected by prefetcher", and give this e.g. a much lower penalty, depending on the bandwidth your hardware can do: e.g. if you can do 1GB/s on a 1GHz processor, this means 1 bytes on average per clock, thus with a 64byte line size one line can be fetched every 64 cycles. As every LL miss will also be a L1 miss (with penalty of 10), you would go for around 50 for the penalty for the new prefetchable memory access event. For latency effects, it gets more complicated as you want to check if there is a dependency between memory accesses, and do something similar to above mentioned interval simulation. For the saturation effects, one needs to take simultaneous execution of multiple threads into account. > Josef: is there a more recent version of the pseudo-cycles formula? > Also, maybe we can incorporate the number of global bus locks into > the formula? Or does the cost of a bus lock depend also on which > cache the selected line is in? Actually, what I have in recent versions of {Q,K}Cachegrind is this: CEst = Ir + 10 Bm + 10 L1m + 20 Ge + 100 L2m + 100 LLm L2m is there for backward compatiblity with old VG versions. If a given event type is not measured, the formula gets truncated accordingly. > I mention this because I remember at some time the LO crew were > complaining that the cycle count from Callgrind was completely misleading > due to not taking into account global bus locks. Hm, ok. The "20 Ge" estimates that locking often can be resolved at the L2/L3 level. However, I never did any measurements myself. I think you mentioned something ... Josef > > J > > (*) eg. the first version of the memcheck leak checker was written > during a 4 hour plane-changing-stop at Chicago O'Hare. > > ---------- Forwarded Message ---------- > > Subject: Re: [Fwd: Re: callgrind / profiling bits ? ...] > Date: Tuesday, December 04, 2012, 12:35:15 am > From: Michael Meeks <<a href="mailto:michael.meeks@suse.com">michael.meeks@suse.com</a>> ... > Julian - incidentally - where does is the latest/greatest set of > fudge-factors for predicting pseudo-cycles live from L1 / L2 misses > etc. :-) We'd like to use that a spreadsheet we want to use to track > incremental libreoffice performance (using callgrind of course). ... > At least from my perspective, I'd -love- to see what impact simple > day-to-day work has on performance particularly when repeatable: what do > dozens of changes per day do to the various documents we have ? - I > suspect slow bit-rot, with steps of improved perf. but ... ;-) perhaps > publishing the graph will change that. -- <a href="mailto:michael.meeks@suse.com">michael.meeks@suse.com</a> <><, Pseudo Engineer, itinerant idiot </blockquote></div> </div>