[Spice-devel] Loop, glib and spice-server

Mon Jan 25 10:55:14 PST 2016

> 
> Hi,
>   I spent quite some time trying to understand the main loop of spice-server.
> (Perhaps I'm tired of rebasing the glib loop patches after 250 patches!)
> 
> There were quite some questions on the loop and the patches left:
> - why new code using glib is doing much more iterations?
> - where the client blocking problems came? See [1], [2] and [3].
> - why all that polling code in red-worker is needed?
> 
> Glib and iterations
> -------------------
> Using strace (poll and other debugging) I tried to understand where all these
> iterations came.
> The reason is multi-threading (and some problems on Glib).
> In our implementation multithreading is simple: we only have a thread!
> But Glib supports multithreading! So assume one thread is doing the poll
> and another what to add a file descriptor or change a timeout how to stop
> the other thread and make it consider the modified information?
> The solution (which I even coded for similar stuff!) is adding a file
> descriptor
> to wake up the other thread. You can see quite clearly from strace,
> spice-server-replay
> was waken up by a file descriptor 10 which turned out to be a eventfd.
> These are the events which should make the poll stop:
> - timeout changes;
> - file descriptor added or removed;
> - file descriptor event types (like read and/or write) changed;
> - other special events (loop exit, explicit wakeups, signals).
> But which kind of events of the above happens in spice-server?
> Our current implementation of timers/watches add and remove source events
> for many changes (timer started again or event mask changed or even set with
> same value). Turn out however that these events are not happening that
> frequently.
> Turn out that to handle recursion of loops Glib does a lot of file descriptor
> addition/removal during the loop. This happen every time so every poll call
> is
> done twice.
> I then added some debugging on when the loop happened (worker_source_prepare
> is
> called), when the poll happened and when the eventfd is cleared.
> What usually happened was
> - poll, one of our events was triggered
> - prepare called
> - poll, eventfd triggered
> - prepare called
> - poll, eventfd triggered
> - eventfd cleared
> So each event needed two prepare call and 3 polls (so 3 glib iteration)
> happened.
> One thing to consider: we run everything in a thread so why changes to file
> descriptor should behave like they are waking the poll even if the poll
> was surely not run? The reason is that although this comment in Glib:
> 
> /* Now wake up the main loop if it is waiting in the poll() */
> 
> the wakeup line:
> 
> g_wakeup_signal (context->wakeup);
> 
> is most of the time executed!
> Second consideration: why eventfd is cleared the second time the event is
> triggered and not at the first time? The reason is that in these lines
> 
> if (context->wake_up_rec.revents)
>   g_wakeup_acknowledge (context->wakeup);
> 
> revents is not set and you have to wait another poll round!
> 
> So how to fix these Glib issues? I wrote this patch (still to send):
> http://pastebin.com/XT0W7xnQ. Is working quite fine on my Fedora system
> (I replaced the system library!).
> On my tests (using spice-server with replay utility) the iteration went from
> 700 to 350!
> 
> 
> Spice-server loop
> -----------------
> So why is client sometimes blocking?
> First let's see the events the worker loop is handling:
> 1- data from dispatcher (this includes wakeups from guest);
> 2- data from channels (so clients);
> 3- timeouts on streams;
> 4- other timeouts (like pings);
> 5- polling from last ring empty (200 times with a poll of
>    10 milliseconds).
> About this last point why is this needed?
> Also note that loops are counted so that 200 times can expire quite
> fast if guest (for instance) keeps sending wakeups (see [2])!
> 
> What happens if:
> - client does not read data for a while and channel pipes are full
>   (network buffer full and items maxed out)
> - guest fills the command ring as command processing is not running
>   due to first condition
> - guest send a lot of wakeups (see [2])
> ?
> Simple: the loop stop entirely! Let's look at the above conditions:
> 1- if nothing special (like user trying to resize client windows)
>    happens no data are sent to dispatcher as guest is not appending
>    to ring and so wakeups are not sent;
> 2- channels have no reasons to send data;
> 3- there are no stream so no timeouts;
> 4- there are quite long to happen, we can't wait minutes;
> 5- the counter run out so the polling stopped.
> What about if after a while client read data and network buffer became
> available again? Nothing! There is no check on this (1 and 2 are just
> checking from data received).
> 
> Happily I managed to reproduce this event with this command
> 
> xzcat win7-boot-record2.xz | SPICE_DEBUG_LEVEL=3 strace -o trace.txt -f -e
> poll,close,execve,clone,write,writev,read,recv -s 100 ./libtool
> --mode=execute ./server/tests/spice-server-replay -p 1235 -c 'remote-viewer
> spice://localhost:1235' -w - -C 5
> 
> Could be that this command is helpful for somebody that wants to try, the
> server
> hangs when large portion of screen are updated (this was a recording of a
> machine
> booting as name suggests).
> 
> So why the polling is helpful? Well... to try avoiding these condition to
> happen! However as me and others demonstrate this is still possible!
> 
> How to fix this? My actual fix is using setting event_timeout when we exit
> from
> red_process_display from last return inside the loop (see also [1]) and last
> return.
> 
> What should be in my opinion a better solution?
> I think that instead of using polling we should change code in order to wait
> if
> we can send data to client (see my "implement no pushing on RedWorker" patch
> on the ML for a possible solution, see [3]). This will solve most of the
> problem
> but still consider this:
> - we are in previous condition (no wakeup, ring full, channel pipes full,
> full
>   sending network buffers);
> - client receive data and buffers became available;
> - poll is waken up (now we detect above condition):
>   - we try to check commands but channel pipes are still full;
>   - we send all data to network so now we stop polling for network buffers;
> - poll is run again.
> Let's see the above poll conditions:
> 1- ring is still full so no wakeups;
> 2- no data from channels;
> 3- no streams;
> 4- far to happen;
> 5- counter are already expired;
> 6- (new, writing) no more reason to poll for writing.
> We are stuck again. I think this can be easyly be fixed with:
> - during worker_source_prepare (before poll):
>   if (worker->was_stuck && !is_stuck(worker))
>     return TRUE;
>   so event is triggered and poll resume
> - during worker_source_dispatch (after poll):
>   worker->was_stuck = is_stuck(worker);
>   if (!worker->was_stuck)
>     red_process_XXX(...);
> 
> Hope I didn't forget nothing (took me quite a while to write this mail) and
> sorry
> for the possible thousand mistakes.
> 
> Oh... what about iterations before and after all of this?
> Before glib loop: about 500 iterations;
> After glib loop: about 700 iterations;
> After all these fixes: about 350 iterations (probably some are due to
> slowdown
> and iteration collapsing due to debugging stuff).
> 
> Frediano
> 
> [1]
> http://cgit.freedesktop.org/~fziglio/spice-server/commit/?h=refactory&id=6131c1942a80c4fb909697f5f4c44fe010fc6f92
> [2]
> http://cgit.freedesktop.org/~fziglio/spice-server/commit/?h=refactory&id=aef334e6e8abc629a4603c890a2982d542b627f4
> [3]
> http://cgit.freedesktop.org/~fziglio/spice-server/commit/?h=refactory&id=ad51aeec0913e6d5ecf7e403e96612658f3b13f2

Would be really helpful if someone instead of digging directly into the code
could ask me what is not clear to have a good explanation understandable so
to have a good documentation/commit log.

I have a small roadmap that (not in strictly order) contains:
- finally glib integration
- loop opimizations
- glib patches (just opened https://bugzilla.gnome.org/show_bug.cgi?id=761102)
- cork flag usage to reduce network traffic decreasing fragmentation
- more automatic tests on loop (test-loop already existing) currently mainly
  testing timers but not watches
- changes to remove mostly polling (usleep calls)
  - handle cursor and display commands at same time during flushing
  - use direct poll for limited loops
- try to reproduce automatically some issues
- better handling of window acknowledge
- automatic bandwidth detection

Frediano