[Spice-devel] [PATCH spice-server v4 0/4] Better handling reset for streaming device

Mon Feb 12 10:53:33 UTC 2018

> 
> > On 12 Feb 2018, at 09:04, Frediano Ziglio <fziglio at redhat.com> wrote:
> > 
> >> 
> >> On 01/30/2018 04:33 PM, Christophe de Dinechin wrote:
> >>> 
> >>> Christophe de Dinechin writes:
> >>> 
> >>>>> On 30 Jan 2018, at 12:56, Frediano Ziglio <fziglio at redhat.com> wrote:
> >>>>> 
> >>>>>> 
> >>>>>> Hi Frediano,
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>>> On 30 Jan 2018, at 11:50, Frediano Ziglio <fziglio at redhat.com> wrote:
> >>>>>>> 
> >>>>>>> ping the series
> >>>>>> 
> >>>>>> I’m currently looking at it. Is it supposed to fix the read errors I
> >>>>>> had
> >>>>>> when
> >>>>>> restarting the streaming agent?
> >>>>>> 
> >>>>> 
> >>>>> Yes, make the reset more stable.
> >>>>> When you close the device the state will be more consistent allowing
> >>>>> basically to kill the process using the device in any state and opening
> >>>>> again. Obviously if you continue to send wrong commands the device will
> >>>>> keep rejecting them.
> >>>>> 
> >>>>> I tried to reproduce the issues reported on IRC and these helped me,
> >>>>> now I avoid entirely to reboot the guest.
> >>>> 
> >>>> OK, right now I get a QEMU crash whenever I do any kind of activity
> >>>> (the keyboard seems to be what triggers it).
> >>>> 
> >>>> I’m trying to reproduce on master to see if your patch is the cause.
> >>>> That host has gone through some unusual nastiness, and may be
> >>>> in a geborked state.
> >>> 
> >>> Reverting the server to master, I am back to the behavior I had before,
> >>> where the same series of events leads to
> >>> 
> >>> DISPLAY=:1 spice-streaming-agent -c noblock=yes
> >>> spice-streaming-agent[2240]: UNKNOWN msg of type 5
> >>> spice-streaming-agent[2240]: BAD VERSION 0 (expected is 1)
> >>> spice-streaming-agent[2240]: BAD VERSION 108 (expected is 1)
> >>> spice-streaming-agent[2240]: BAD VERSION 97 (expected is 1)
> >>> spice-streaming-agent[2240]: read command from device FAILED -- read 1
> >>> expected 8
> >>> spice-streaming-agent[2240]: FAILED to read command
> >> 
> >> 
> >> Hi Christophe,
> >> 
> >> There are some problems here:
> >> 1. (I guess) your spice-streaming-agent sends CURSOR messages
> >>    which currently the spice-server does not know to handle.
> >>    (Frediano sent patches for that but still no review).
> >> 
> >>    For now try to comment out the code in spice-streaming-agent
> >>    that sends CURSOR commands.
> >> 
> > 
> > Would not make sense to review the patches instead of having
> > to write another workaround also taking into account that
> > the review process is taking more than 6 months?
> > 
> >> 2. spice-server gets an unknown command. It sends a message to the
> >>    spice-streaming-agent to let it know the server read an invalid
> >>    message. spice-streaming-agent is missing a code to handle such
> >>    a message.
> >> 
> >>    This should be fixed.
> >> 
> >> 3. When such messages are in play, they are not fully read (code
> >>    breaks after reading the header). This makes the spice-server
> >>    and spice-streaming-agent go out of sync).
> >> 
> >>    This may be fixed. I'm not sure we can recover
> >>    all errors, and perhaps the right thing to do is sync
> >>    with close/open of the virtio-serial-port.
> > 
> > This is what this patch series is doing.
> 
> And indeed, that patch series improves things if you have a recent enough
> QEMU. But it triggers a but in 2.9.1, which crashes QEMU and kills the VM. I
> saw that on my test machine, Frediano now reproduced.

but -> bug (if is not obvious).

Yes, is a bug recognized and fixed in recent Qemu.
Today I was checking the RHEL 7.5 package (they have 1800 patches on top!)
and the fix is not included.
Is not clear how to handle this bug.
Detecting if runtime Qemu has this bug is not easy (is a NULL check inside
a function). Also there are cases where spice-server is not used inside Qemu.
Detecting the Qemu version at compile time is not easy either as it adds a
reverse build dependency at build time (consider the RPM build) which
is a bad thing and also would not cover easily cases where an old
version has the fix for this bug.
Also we have to consider different distributions.
I can split the patch in 2 to avoid to trigger the bug on buggy versions
but this is not optimal, the device will keep working and potentially
the agent will ignore the error.
The Qemu bug is triggered when the device is closed inside a write
operation. The way character device works in case of spice is that Qemu
calls spice_server_char_device_wakeup which try to read and write.
In the patch I propose I call state(sin, 0) which closes the device
inside read, inside the Qemu write.
One possible workaround would be to call state(sin, 0) not in read
but the way I see would be to schedule a timer event that would
check if device has a pending error and in this case attempting to
close the device. Quite ugly code potentially not that easy to maintain.
I'll try to code and test this last workaround.

> > 
> >>    However reading the whole message (if it's not too big) should
> >>    at least keep the server/agent synchronized, even if an unknown
> >>    message encountered.
> >> 
> > 
> > No, is not enough. If the message is expected to change the server
> > state discarding the message will make the dialog out of sync anyway.
> > Would make sense if the server could send back an error for a specific
> > message (currently if you send a batch of messages you don't know
> > the relationship between error and message). Also would mean that
> > the client MUST handle the error from a message and as the messages
> > replies are async to keep a queue of messages.
> > To me it seems that at the end would be just more complicated than
> > expected instead of graceful.
> > 
> >> Regards,
> >>     Uri.
> >> 
> >> 
> > 

Frediano