[RFC] [PATCH 1/2] Add a basic support of logging cpu and gpu time

Wed Aug 24 15:15:58 PDT 2011

On Wednesday, August 24, 2011 03:28:00 PM Chris Fester wrote:
> On Wed, Aug 24, 2011 at 12:00 PM, Zack Rusin <zack at kde.org> wrote:
> > Long term is would probably make sense to just use:
> > 
> > http://www.fftw.org/cycle.h
> 
> Yup, sounds good.
> 
> > but my main concern for CPU timings is that compression and disk io will
> > skew
> > them a bit.
> 
> Yes, that is possible.  I was thinking of reading the timestamp immediately
> before the call, and then immediately after. 

But then you're measuring CPU time of an asynchronous call to the GPU which 
probably won't tell you a lot. It possibly tells you how quick the driver is 
at queuing the data which isn't likely to be very strongly related to the 
actual performance. In other words I'm not sure if this particular timing is 
useful. For visualization purposes CPU time between swapbuffers and CPU 
utilization between are going to tell you a lot more than CPU times for each 
driver call.

> > example what I'd love to know:
> > 
> > - what exactly it is that we're trying to measure,
> 
> In my case, trying to measure call times to see if there's gross
> differences between driver versions.  I'll admit it's kinda specialized,
> and most folks may not find it useful. 

I think the information is useful, but as mentioned above I don't think it's 
what you'd be measuring by checking the CPU time.
I think for your particular use case measuring actual GPU times (e.g. using 
the timer_query extension) during retracing would be a lot more meaningful. In 
fact I think it'd be exactly what you're looking for.

I think that computing timings during tracing could be useful if your 
application is CPU bound - huge discrepency between tracing and retracing 
timings would mean that the app is spending too much time creating the data. 
And the biggest offsets would point you to the places in your app that are the 
bottlenecks.

Other than this case I'm not sure if computing the timings while tracing is 
particular useful, because for all other case it should be a lot better to do 
it while retracing.

> - what's the performance of tracing while fetching both cpu and gpu time
> for
> 
> > every call,
> 
> The RDTSC instruction itself takes a clock cycle or two of cpu time.  In
> total, after transferring the registers to memory, the latency is on the
> order of nanoseconds. 'Course, there's still definite drawbacks to RDTSC,
> especially WRT dynamic frequency scaling and sleep states.

That sounds pretty good.

> Has there been discussion on pushing the writer out to a dedicated thread?

Yea, I've already done it in the threaded-trace branch. Snappy compression 
results in a lot, lot bigger performance improvement so the compression branch 
should be integrated first.

> Or is there too much of a risk that a tracing thread would outrun the
> writer?

Well, that's often the case. I use 32mb ringbuffer in the branch and if the 
ringbuffer is full I use mutexes and conditional variables to stall the 
rendering thread while the disk writes the contents of the ringbuffer out. 

> Experiments are fun.  :)

Yea :) I think in particular for this it's going to take a few iterations to 
make it useful. Especially figuring out what we need to correctly and nicely 
visualize that data.

z