Add support for multi-threaded playback

Tue Oct 30 06:12:36 PDT 2012

On Tue, 2012-10-30 at 12:24 +0000, José Fonseca wrote:
> On Fri, Oct 26, 2012 at 7:53 PM, Imre Deak <imre.deak at intel.com> wrote:
> > Hi,
> >
> > On Fri, 2012-10-26 at 18:49 +0100, José Fonseca wrote:
> >> Imre,
> >>
> >> Thanks for this. I took your mt-trace patches and did some follow on changes:
> >>
> >> - port it to macosx and windows (implement and use C++11-like
> >> threading primitives)
> >>
> >> - make threads synchronous i.e., respect the ordering of the calls in
> >> the trace, otherwise there would be random results and race
> >> conditions, as the locking is not captured in the trace.
> >
> > I thought the workqueue approach already guaranteed this:
> >
> > if (prev_thread_id != call->thread_id) {
> >      if (thread_wq)
> >           thread_wq->flush();
> >      thread_wq = get_work_queue(call->thread_id);
> >      prev_thread_id = call->thread_id;
> > }
> > thread_wq->queue_work(render_work);
> >
> > So if the new call's thread ID is different than the previous one we
> > wait until all calls in the previous thread finish. Could you explain
> > what extra synchronization we need here?
> 
> I was getting random heap corruption until I added
> https://github.com/apitrace/apitrace/commit/55c2ece87567ec831b9cf5c2a4a9d30035c093bb
> 
> That is, there were some race conditions due to the parsing and the
> retracing happening concurrently.

Ok, I assumed decoding and performing the actual calls are not related,
but it seems then it's not true. I tried to run the android emulator
trace you linked and it indeed didn't work with the workqueue version. I
also tried serializing decoding and executing with the above patch, but
it didn't help..

> This could be fixed with more careful locking on all the parser
> internal structures. But I saw no point of pursing that road. Instead
> I chose to pass the responsibility of parsing the trace to the thread
> that executed the last call, which achieves the same more efficiently
> (no thread switching per call, no mutex locking per call).
> 
> In short, there is now only one active thread at any single instance.
> Therefore race conditions are impossible. And for single threaded
> traces this gracefully degrades to exactly what we were doing before
> (i.e, performance for single threaded traces is exactly the same).

Ok, it is simpler, but of course this way you will get decoding as an
overhead. I saw a clear performance improvement even for single threaded
traces when having the decoding on a separate thread.

> >> - make it faster -- the parsing is done in the thread that is
> >> executing, so there is less thread switching.
> >
> > I'd have to think more how this improves things. Afaics on multi-core at
> > least there shouldn't be much task switching, except for the above
> > synchronization points.
> 
> Whereas before it was necessary to lock a mutex on every call, now a
> mutex is only locked whenever thread_id changes.
> 
> That was particularly noticeable for applications that use immediate
> vertex data (glVertex and friends) which have a lot of calls.

At least it shouldn't have locked per call. The workqueue runner put all
calls decoded so far on a separate local list and executed the calls
without holding the lock. Perhaps what you saw was frequent task
switching in the traced application, since forward decoding happened
only until the next thread switch. But that is not a hard requirement
and we could easily let the decoding continue (maybe with a
per-workqueue lock).

> Furthermore, there is no change whatsoever when a trace is single
> threaded (parsing and retracing happens on the main thread just as
> before), therefore multithreaded retracing is on all the time --
> there's no longer any option to enable it anymore.

Well, it works now so it's good. The decoding can be moved to a separate
thread later if performance becomes an issue.

--Imre