[Pixman] RFC: Pixman benchmark CPU time measurement

Wed Jun 3 00:51:25 PDT 2015

On Wed, 3 Jun 2015 08:38:25 +0300
Siarhei Siamashka <siarhei.siamashka at gmail.com> wrote:

> On Tue, 2 Jun 2015 10:32:35 +0300
> Pekka Paalanen <ppaalanen at gmail.com> wrote:
> 
> > Hi,
> > 
> > most pixman performance benchmarks currently rely on gettime() from
> > test/util.[ch]:
> > - lowlevel-blt-bench
> > - prng-test
> > - radial-perf-test
> > - scaling-bench
> > 
> > Furthermore, affine-bench has its own gettimei() which is essentially
> > gettime() but with uin32_t instead of double.
> > 
> > double
> > gettime (void)
> > {
> > #ifdef HAVE_GETTIMEOFDAY
> >     struct timeval tv;
> > 
> >     gettimeofday (&tv, NULL);
> >     return (double)((int64_t)tv.tv_sec * 1000000 + tv.tv_usec) / 1000000.;
> > #else
> >     return (double)clock() / (double)CLOCKS_PER_SEC;
> > #endif
> > }
> > 
> > This definition of gettime() has several potential drawbacks:
> > 
> > 1. clock() will wrap around often, the manual page warns that in some
> >    cases it wraps around every 72 minutes. As the code in Pixman never
> >    expects a wraparound, this is subtly broken. This is a fallback path
> >    for systems that do not provide gettimeofday(), so it is rarely used
> >    if at all.
> 
> Yes, the clock() fallback is there just to fill the void and make this
> code compile on the systems, which don't have gettimeofday().
> 
> It also measures the CPU time instead of wall-clock time, but this
> can't be helped.
> 
> > 2. gettimeofday() measures wall-clock time, which might not be the best
> >    to measure code performance on a CPU, because all other load in the
> >    system will affect the result. It's probably not a significant
> >    problem on fast systems where you know to run your benchmarks
> >    uncontended.
> 
> In my opinion, wall-clock time is exactly what we need to measure.

Hi Siarhei,

any other reason than the below?

Btw. you don't need to spend time answering this, I already got the
high level feeling I was looking for: nothing to fix here. The rest is
just details for the enthusiastic. :-)

> If you have something else running concurrently in the system, the
> benchmark results are already screwed. Relying on the OS scheduler
> to correctly account the CPU time still does not compensate, for
> example, the cache thrashing effects.

Further down below, you give the same arguments why measuring on rpi is
inaccurate as I would use here to justify using process time (the
kernel drivers and other processes, even if the system is seemingly
idle). :-)

> Basically, with the wall-clock time measurements it is easier to
> notice that something is wrong if the benchmark is not run uncontended.
> It happened to me a few times to accidentally run benchmarks with the
> compilation running in the background. Fortunately, the results were
> very much off and this was sufficient to raise a suspicion :-)
> 
> Maybe trying to measure both wall-clock and CPU time at the same time
> and comparing the results in the end would be an even better safety
> guard?

Would even that detect a contended system if there were multiple CPU
cores?

> > 3. gettimeofday() is not only subject to NTP adjustments but is also
> >    affected by setting the system clock. IOW, this is not a guaranteed
> >    monotonic clock. Again, unlikely to be a problem in most cases, as
> >    benchmarks run long enough to even out NTP skews, but short enough to
> >    avoid accidentally hitting clock changes. (Or so I would assume.)
> 
> Yes, this is unfortunate. But this is hardly the worst problem.

Yeah, none of these are really bad problems.

We could fix just this one by using CLOCK_MONOTONIC when it's
available. I'm not completely sure if that requires a runtime check for
the existense of CLOCK_MONOTONIC particularly. I think
CLOCK_MONOTONIC_RAW would need it.

> > 4. Using double to store an absolute timestamp is suspicious to me. In
> >    the end, you always compute the difference between two timestamps,
> >    and using a floating point type may cause catastrophic cancellation
> >    [1] depending on absolute values. However, [1] also explains that a
> >    double is enough. But, given that we read an arbitrary system clock
> >    whose starting point is unknown (ok, Epoch for the moment), we might
> >    already be getting values too large to maintain the expected
> >    accuracy (for floats, sure; for doubles, who knows?)
> 
> The double type for storing the time in seconds (since Epoch) was
> selected on purpose.
> 
> First, there is no accuracy problem. The time in seconds since
> Epoch is going to fit in a 32-bit value at least until 2038. The
> mantissa in double has 53 bits. Which leaves us with 53 - 32 = 21 bits
> for the fractional part. As such, the granularity of the time
> representation can be estimated as 1 / 2^21 = ~0.12 nanoseconds.
> Which is roughly equivalent to one clock cycle of a hypothetical
> 8 GHz processor. The calculations may be off by one here or there,
> but we are interested in the order of magnitude. This is more than
> enough.

Yes, I totally expected you to have that justification.

However, if we change to CLOCK_MONOTONIC or something, the Epoch is no
longer a given. time_t can be 64-bit, too, on 64-bit arch at least.

> Second, the floating point representation is much better for handling
> timestamps. One reason is that we avoid overflow problems (the
> intermediate calculations may potentially have unsafe multiplications
> by the CPU clock frequency, etc.). And if we want to calculate things
> like the standard deviation, then the floating point representation is
> also much more convenient. Another reason is that the printf style
> functions work nicely with doubles, but have some portability
> inconveniences with 64-bit integers.

As you see below, I would keep using double for time intervals, for
exactly these reasons. Just not for absolute timestamps.

> This was the reason why obtaining the time (in a potentially system
> dependent way) was abstracted in the gettime() function and the rest
> of the code only works with convenient doubles.

I would keep that abstraction for intervals.

> Thanks for mentioning the new gettimei() function. Was there a
> justified reason to introduce it instead of using gettime()? 

I've forgot my own justification. Ben wrote a lengthy email about it
from his side.

When I look at it now myself, the use of uint32_t will make wraparounds
harmless, and still produce the correct result when computing a time
interval, assuming the actual interval is representable in uin32_t. Of
course, only the value from clock() wraps around. gettimei() itself in
the gettimeofday() path can truncate the original value, but it's
harmless for the same reason.

If we fixed gettime() for clock() wraparounds rather than ignoring them,
I suppose there would be no reason to have gettimei().

Ben?

Or if systems without gettimeofday() are rare enough, why wouldn't we
just replace the clock() call with a 'return 0'? On affected systems,
the benchmarks would then reliably hang (busyloop), rather than hang
rarely. Or just abort(), because benchmarking is meaningless then?

> > I would propose the following:
> > 
> > - runtime clock selection with this priority order:
> > 	1. clock_gettime(CLOCK_PROCESS_CPUTIME_ID)
> > 	2. getrusage(RUSAGE_SELF) -> rusage.ru_utime (user time)
> > 	3. gettimeofday()
> > 	4. clock()
> >   Naturally with build time checks, too. For 3 and 4 would print a
> >   warning about inaccurate measurements. clock_gettime(CLOCK_MONOTONIC)
> >   is not in the list because I would assume getrusage() is more widely
> >   available and I'd like to use process time before wall-clock delta.
> > 
> > - A separate void init_gettime(void) for choosing the clock.
> > 
> > - void gettime(struct timespec *ts) for reading the clock.
> > 
> > - double elapsed(const struct timespec *begin, const struct timespec
> >   *end) for getting the elapsed time in seconds.
> 
> This API looks excessively complicated to me.

Only if you keep the assumption that using a double for absolute
timestamps is always fine. You always compute time intervals anyway.

This would even give a nice common place to detect clock wraparounds or
going backwards so you'd be guaranteed to never get a negative interval
length.

If we need runtime detection of available clocks, then having
init_gettime() would be useful to detect once instead of per OpenMP
thread.

Yes, most of this is just me being pedantic.

> > In my experiments on the Raspberry Pi 1 [2], it seems to me that
> > clock_gettime(CLOCK_PROCESS_CPUTIME_ID) is the most accurate CPU time
> > measurement, getrusage() being possibly rounded to HZ ticks. If neither
> > is available, I'll just forget about accuracy and get whatever we can
> > get, essentially what the current gettime() does but without the
> > floating point difference and the wraparound accounted for (at least
> > avoid negative elapsed time, which might break algorithms).
> > 
> > What would you think about this scheme?
> > 
> > I am not making a promise to implement this any time soon, I would just
> > like to hear a few opinions whether it is worth even considering.
> > 
> > This is also like a public note for myself, in case I need to think
> > about these again. My difficulties with benchmarking Pixman on the Rpi1
> > is what prompted this, but this wouldn't solve most of the problems I
> > have there.
> 
> I'm afraid that focusing on the time measurement accuracy can't
> provide any real practical improvements. Even if it is measured
> very accurately (and I believe that we are already doing it good
> enough), the actual running time is non-deterministic and depends
> on many factors in the modern multitasking operating systems.

Which is why I'd like to use process time, not wall-clock.

> One major problem at least on ARM devices is the fragmentation of
> physical memory pages. If you allocate a memory buffer, which is
> contiguous in the virtual address space, this does not mean that
> this buffer is also contiguous in the physical memory. Moreover,
> we have no idea how to get and take into account this information
> from the userspace. The actual layout of the buffers in the
> physical memory does affect performance because the CPU data caches
> have limited associativity and the DRAM is organized in banks.
> Using huge pages can mitigate the issue, but this feature is not
> in a very good shape on ARM (huge pages are only supported in LPAE
> mode, but LPAE is not supported on older processors and uses 3-level
> page tables instead of 2-level page tables, which makes it somewhat
> slower). Still we can introduce huge pages support in the pixman
> benchmark programs. It will help on certain devices with certain
> kernels.
> 
> Also the USB hardware is not great in the Raspberry Pi and it was
> generating tons of interrupts at least some time ago. This parasitic
> activity can interfere with the running tasks and make the benchmark
> results differ across multiple runs. You can watch the counters
> in /proc/interrupts to see if this is still the case.
> 
> Changing one part of the code and recompiling the library may change
> relative alignments of many unrelated code parts. And the CPU
> instruction cache also has limited associativity.
> 
> Measuring small differences is going to be always problematic because
> of the less than perfect signal/noise ratio.

Yes indeed.

I might revisit this after improving my benchmark harness.

Btw. the memory fragmentation is a good clue, maybe that could explain
why we see slow changes (over hours) in performance, and why different
invocations of the same binary e.g. lowlevel-blt-bench give so
different results.

Thanks,
pq