<div class="gmail_quote">On Wed, Aug 24, 2011 at 12:00 PM, Zack Rusin <<a href="mailto:zack@kde.org">zack@kde.org</a>> wrote: <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"> Long term is would probably make sense to just use: <a href="http://www.fftw.org/cycle.h" target="_blank">http://www.fftw.org/cycle.h</a> </blockquote><div>Yup, sounds good. </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> but my main concern for CPU timings is that compression and disk io will skew them a bit. </blockquote><div>Yes, that is possible. I was thinking of reading the timestamp immediately before the call, and then immediately after. Something like: diff --git a/trace_writer.cpp b/trace_writer.cpp index af8d62f..56aa4fb 100644 --- a/trace_writer.cpp +++ b/trace_writer.cpp @@ -33,6 +33,7 @@ #include <zlib.h> #include "os.hpp" +#include "tsc.h" #include "trace_writer.hpp" #include "trace_format.hpp" @@ -206,12 +207,19 @@ void Writer::endEnter(void) { _writeByte(Trace::CALL_END); gzflush(g_gzFile, Z_SYNC_FLUSH); OS::ReleaseMutex(); + //collect call begin timing here (TLS protects us, don't need mutex) + startTimestamp = getTSC(); } void Writer::beginLeave(unsigned call) { + //collect call end timing here + uint64_t endTimestamp = getTSC(); OS::AcquireMutex(); _writeByte(Trace::EVENT_LEAVE); _writeUInt(call); + _writeByte(Trace::CALL_TIMING); + _writeUInt(startTimestamp); + _writeUInt(endTimestamp); } void Writer::endLeave(void) { I was hoping to use thread local storage for startTimestamp. Unfortunately the application I'm tracing was not compiled with -std=gnu++0x, and the runtime linker complains that it can't find the startTimestamp symbol. So I may have to move the timestamp reads into the python generator code and create a Writer::WriteTiming(a,b), or something close to that, so that the startTimestamp variable is on the stack. That's probably more multi-platform friendly anyway. By doing the collection like above, you're effectively only timing the call, not the compression. Granted, the scheduler may still penalize you because you're eating up CPU cycles for the compression and IO. And your overall frame time will be larger (as you mention below). </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">So I think we need to figure out the use case for this stuff. For example what I'd love to know: - what exactly it is that we're trying to measure, </blockquote><div> In my case, trying to measure call times to see if there's gross differences between driver versions. I'll admit it's kinda specialized, and most folks may not find it useful. I'll take the opportunity to thank you guys for getting me this far tho. :) </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">- what's the performance of tracing while fetching both cpu and gpu time for every call, </blockquote><div> </div><div>The RDTSC instruction itself takes a clock cycle or two of cpu time. In total, after transferring the registers to memory, the latency is on the order of nanoseconds. 'Course, there's still definite drawbacks to RDTSC, especially WRT dynamic frequency scaling and sleep states. </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> - how should the data be stored so that the tools can do the most out of them, for example as the TODO.markdown mentioned we'd really love to have a "Timeline view" ( <a href="http://msdn.microsoft.com/en-us/library/ee417180%28v=vs.85%29.aspx" target="_blank">http://msdn.microsoft.com/en-us/library/ee417180(v=vs.85).aspx</a> ) I haven't yet looked at how the data would need to be stored for this to happen. <div class="im"> > My admittedly lofty goal was to trace an application, capturing the TSC > data. Then run the retrace, also capturing TSC data. One could then diff > frames/calls between the trace and retrace to see CPU utliization > differences. </div>That sounds like a great idea. The issue is that between the app actually generating the data and tracing compressing the data (compression is slower than decompression), the tracing timings will always be slower than the retracing. Also the disk io read/write speed will heavily influence both. </blockquote><div> Has there been discussion on pushing the writer out to a dedicated thread? Or is there too much of a risk that a tracing thread would outrun the writer? </div><div> </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> So I think there's a lot of open questions. Most likely initially we need to push some implementation of this stuff to a branch and start experimenting with it. z </blockquote></div> Experiments are fun. :) Chris -- Oh, meltdown... It's one of these annoying buzzwords. We prefer to call it an unrequested fission surplus. -- Mr. Burns, The Simpsons