Add support for multi-threaded playback

Tue Oct 30 08:21:56 PDT 2012

On Tue, Oct 30, 2012 at 1:12 PM, Imre Deak <imre.deak at intel.com> wrote:
> On Tue, 2012-10-30 at 12:24 +0000, José Fonseca wrote:
>> This could be fixed with more careful locking on all the parser
>> internal structures. But I saw no point of pursing that road. Instead
>> I chose to pass the responsibility of parsing the trace to the thread
>> that executed the last call, which achieves the same more efficiently
>> (no thread switching per call, no mutex locking per call).
>>
>> In short, there is now only one active thread at any single instance.
>> Therefore race conditions are impossible. And for single threaded
>> traces this gracefully degrades to exactly what we were doing before
>> (i.e, performance for single threaded traces is exactly the same).
>
> Ok, it is simpler, but of course this way you will get decoding as an
> overhead. I saw a clear performance improvement even for single threaded
> traces when having the decoding on a separate thread.

Yes. I see now. My measurements where already after doing that commit above.

Some maximum limit on the workqueue would be necessary though,
otherwise if the retracing was slow and could keep up (e.g., OpenGL in
software), we could end up decompressing the whole trace into memory.

>> >> - make it faster -- the parsing is done in the thread that is
>> >> executing, so there is less thread switching.
>> >
>> > I'd have to think more how this improves things. Afaics on multi-core at
>> > least there shouldn't be much task switching, except for the above
>> > synchronization points.
>>
>> Whereas before it was necessary to lock a mutex on every call, now a
>> mutex is only locked whenever thread_id changes.
>>
>> That was particularly noticeable for applications that use immediate
>> vertex data (glVertex and friends) which have a lot of calls.
>
> At least it shouldn't have locked per call. The workqueue runner put all
> calls decoded so far on a separate local list and executed the calls
> without holding the lock. Perhaps what you saw was frequent task
> switching in the traced application, since forward decoding happened
> only until the next thread switch. But that is not a hard requirement
> and we could easily let the decoding continue (maybe with a
> per-workqueue lock).

I see it now. Yes, you're right.

>> Furthermore, there is no change whatsoever when a trace is single
>> threaded (parsing and retracing happens on the main thread just as
>> before), therefore multithreaded retracing is on all the time --
>> there's no longer any option to enable it anymore.
>
> Well, it works now so it's good.

I now better appreciate the advantages of your approach.  My
subsequent changes were the shortest distance I saw to get this in a
mergeable state (working reliably, with no regressions), but we can
always revisit upon this.

> The decoding can be moved to a separate
> thread later if performance becomes an issue.

We could put just decompression in a separate thread. Zack actually
did this once, on
https://github.com/apitrace/apitrace/commits/threaded-trace , but this
was before switching compression from zlb to snappy (which eliminated
must of the cpu overhead), and from random accessing traces (for the
GUI), which would make this a bit harder.

Jose