[Spice-devel] [PATCH 8/9] worker: make sure we dispatch after releasing items
Jonathon Jongsma
jjongsma at redhat.com
Fri Dec 18 14:59:12 PST 2015
On Fri, 2015-12-18 at 06:10 -0500, Frediano Ziglio wrote:
> > Hi
> >
> > ----- Original Message -----
> > > On Wed, 2015-12-16 at 18:16 -0500, Marc-André Lureau wrote:
> > > > Hi
> > > >
> > > > ----- Original Message -----
> > > > I actually think that was it. The worker thread may currently block
> > > > because
> > > > the client is too slow. But then, when we release item, the pipe is
> > > > going
> > > > to
> > > > be smaller hopefully, and thus we can try to dispatch again without
> > > > waiting
> > > > for the timeout. (waiting for timers can really degrade performances in
> > > > some
> > > > cases, like you see today with the qemu virgl timer, but that's another
> > > > story where I can show you figures when we merge gl scanounts in spice
> > > > ;)
> > >
> > > Then I'm still confused. In the old loop, we had a very explicit sequence
> > > of
> > > behavior. For each loop:
> > >
> > > 1 calculate timeout for poll() function
> > > 2 poll() fds with timeout from step 1
> > > 3 service expired timers
> > > 4 handle fd watch events
> > > 5 process cursor commands from guest
> > > 6 process display commands from guest
> > > 7 push queued messages to clients
> > > 8 GOTO 1
> > >
> > > After pushing the queued messages to the client, we did not go back and
> > > handle
> > > cursor and display commands again. We went back to the top of the loop and
> > > did a
> > > poll with timeout again. So I fail to see why the glib loop needs to short
> > > -circuit the timeout and return to step 5 when the old design didn't.
> > >
> > > Or is this change unrelated to the new glib loop and the worker has simply
> > > always been too slow even with the old loop implementation?
> > >
> >
> > I wouldn't say "too slow", and I wouldn't try to mimic the old code either.
> > For me the goal is simply to process the pipe/command as fast as possible
> > and if possible, get rid of timeouts.
>
> I can understand that not mimic can be better but at the moment this is
> causing
> a bit of regressions and this patch was trying to avoid some (I think).
>
> I looked more deeply at glib code. The sources are stored (appended) in
> multiple
> list (each for every priority and given at the moment we are using a single
> priority which was similar to former code anyway). Considering that the
> processing
> code is registered at worker initialization and that watches are recreated
> when
> event masks are updated this possibly lead the watched to be after the process
> leading the loop to be:
>
> 1 calculate timeout for poll() function
> 2 poll() fds with timeout from step 1
> 3 process cursor commands from guest
> 4 process display commands from guest
> 5 push queued messages to clients
> 6 service expired timers
> 7 handle fd watch events
> 8 GOTO 1
> (not sure about timers but are not that frequent).
> This obviously make stuff worse as after step 7 you have to wait the poll (2)
> to get commands processed again.
> I'm not saying that the patch address also some other issues but that the glib
> code
> actually make this patch more useful (that is glib have regressions).
>
> On a more wide looking. What are this (and possibly "worker: avoid server
> hanging
> when no client are connected.") patch trying to fix?
> Basically trying to handle slow clients. On fast clients you don't have many
> issues.
> Pipe queue is (or should be) quite small and all commands are processed. But
> on slow
> clients we process till the queue reach a given size. I honestly don't have
> all
> knowledge on what happen next. When you do a push on RedChannel you try to
> send as
> many items till network queue is full. If the client is just not processing
> you can
> only add items to the queue. After a while the limit is reached and to avoid
> queuing
> indefinitely you stop processing commands. Now I would have some questions:
> - what happen in the guest? Just after a while the screen get no more updates?
> And if the client ask for an update?
> - does item collapsing occurs only if client is not connected? I think that
> what should
> happen is that if there are too items items should be collapsed (reduced
> drawing
> on surfaces) and processing should continue.
> - what happen if client keep sending request that trigger appending items but
> avoid
> the read? Can a DoS be triggered?
>
> Frediano
So I did some very simplistic testing here. Basically I looked at the
red_process_display() function to see how often we encountered a full pipe queue
(since that is the situation that Marc-Andre thinks this patch is trying to
improve). I was curious how often this happened, so I just instrumented the code
to count how many times we called red_process_display() and how many of those
times we returned from this function due to a full queue (rather than because
there were no more commands to process). Here's the patch:
diff --git a/server/red-worker.c b/server/red-worker.c
index dfaf4ba..dec635d 100644
--- a/server/red-worker.c
+++ b/server/red-worker.c
@@ -229,7 +235,10 @@ static int red_process_display(RedWorker *worker, uint32_t
max_pipe_size, int *r
QXLCommandExt ext_cmd;
int n = 0;
uint64_t start = spice_get_monotonic_time_ns();
+ static int total = 0;
+ static int full = 0;
+ total++;
if (!worker->running) {
*ring_is_empty = TRUE;
return n;
@@ -335,6 +344,9 @@ static int red_process_display(RedWorker *worker, uint32_t
max_pipe_size, int *r
return n;
}
}
+ full++;
+ fprintf(stderr, "display channel exceeded max pipe size: %i/%i (%g)\n",
+ full, total, (double)full / total * 100.0);
return n;
}
This was my procedure:
- Boot RHEL7 guest
- wait for boot sequence to finish
- log into GNOME
- start web browser, open a page
- open another page in another tab
- open video player and play video in full screen
- enable second display
- resize second display
- disable second display
- exit video player
- shut down guest
I did this same test with the current master branch, with the glib loop patch
applied, and also with this patch on top of the glib loop patch.
What I noticed is that during the boot sequence and the shutdown sequence (when
we're in a VT and text is scrolling up the screen), we hit the queue limit quite
frequently. In fact, nearly 30% of the time that we call red_process_display(),
we return because our pipe queue is full. However, after X has started and we're
in the desktop, we never hit a full-queue condition. When I shut down the guest,
it briefly returns to the VT until it powers down, and we again hit some full
-queue conditions.
This was true of both the old loop and the new glib loop (with and without the
patch).
There were some differences between the old and new loop. In particular, the new
glib loop called red_process_display() more times in roughly the same elapsed
time. It also had a higher percentage of calls that returned because the queue
was full (but again, this did not happen during "normal" operation, but only
during the boot and shutdown phase). The glib loop with this patch on top had
roughly the same percentage as the glib loop on its own, but the total number of
calls to red_process_display() was higher, unsurprisingly.
Here is some example output:
Old loop: http://ur1.ca/ocnyn
Glib loop: http://ur1.ca/ocnyp
glib loop with this patch applied: http://ur1.ca/oco1u
Notice that lines such as this indicate that GDM has started or I've logged into
GNOME:
main_channel_handle_parsed: agent start
Then I use the desktop for about a minute before shutting down again. No more
log messages are printed until shutdown begins.
I also did another test where I used the guest "normally" (browsing internet,
watching videos, writing text in libreoffice, etc) for quite a while, and still
never triggered the full-queue condition.
I'm not sure what conclusions to draw from this exactly. There is probably some
better way to test this. Perhaps we should test how long a pipe item stays in
the queue before it is sent.
Jonathon
More information about the Spice-devel
mailing list