[Spice-devel] [PATCH] migration: Don't assert() if MIGRATE_DATA comes before attaching the agent
Christophe Fergeau
cfergeau at redhat.com
Thu May 15 07:14:05 PDT 2014
Hi,
On Wed, Apr 30, 2014 at 08:11:38PM +0300, Uri Lublin wrote:
> Hi Christophe,
>
> I don't understand what exactly goes wrong and how this
> patch fixes the race.
>
>
> On 04/01/2014 04:58 PM, Christophe Fergeau wrote:
> >During seamless migration, after switching host, if a client was connected
> >during the migration, it will have data to send back to the new
> >qemu/spice-server instance. This is handled through MIGRATE_DATA messages.
> >SPICE char devices use such MIGRATE_DATA messages to restore their state.
> >
> >However, the MIGRATE_DATA message can arrive any time after the new qemu
> >instance has started, this can happen before or after the SPICE char
> >devices have been created. In order to handle this, if the migrate data
> >arrives early, it's stored in reds->agent_state.mig_data, and
> >attach_to_red_agent() will restore the agent state as appropriate.
>
> Adding more information -- I think the race here is between qemu-kvm
> migrating the virtio-device state (with main/migration thread) vs spice
> migrating
> "agent" channel state.
What do you call 'spice migrating "agent"" channel state" ? Is it the
"MIGRATE DATA" spice message I I mention or something else?
>
> The virtio-device state load function, calls (if agent service is running in
> the guest)
> virtio_serial_load ->
> qemu_char_fe_open ->
> spice_chr_guest_open ->
> vmc_register_interface ->
> qemu_spice_add_interface ->
> spice_server_add_interface ->
> spice_server_char_device_add_interface ->
> attach_to_red_agent
>
> >Unfortunately this does not work as expected as expected. If
> >attach_to_red_agent() is called before the MIGRATE_DATA message reaches the
> >server, all goes well, but if MIGRATE_DATA reaches the server before
> >attach_to_red_agent() gets called, then some assert() get triggered in
> >spice_char_device_state_restore():
> >
> >((null):32507): Spice-ERROR **: char_device.c:937:spice_char_device_state_restore: assertion `dev->num_clients == 1 && dev->wait_for_migrate_data' failed
> >Thread 3 (Thread 0x7f406b543700 (LWP 32543)):
> >Thread 2 (Thread 0x7f40697ff700 (LWP 32586)):
> >Thread 1 (Thread 0x7f4079b45a40 (LWP 32507)):
>
> If we split this into 2 asserts, one for num_clients and the other to
> wait_for_migrate_data, we'll know which one's at fault.
Both were at fault. (dev->num_client is 0 and dev->wait_for_migrate_data
is FALSE).
>
> >What happens is that dev->wait_for_migrate_data is unconditionally cleared when
> >completing handling of the MIGRATE_DATA message, so it's FALSE when
> >spice_char_device_state_restore() is called.
>
> dev->wait_for_migrate_data is cleared upon
> client_remove/device_reset/state_restore and in state_restore it's
> only after it passed the assert mentioned above.
The race happens when we receive a MIGRATE_DATA message from the client before the codepath you describe above is hit. reds_handle_migrate_data() will get called, and this will clear dev->wait_for_migrate_data. I forgot which codepaths were triggered though when the bug occurs, but this ends up clearing wait_for_migrate_data. I can look closer at this codepath if you want after reproducing.
>
> > Moreover, dev->num_clients is
> >also 0 as this is only increased by spice_char_device_start() which
> >spice_server_char_device_add_interface() calls after
> >attach_to_red_agent()/spice_char_device_state_restore() have been called.
>
> dev->num_clients is set to 1 in spice_char_device_client_add() called
> from spicevmc_connect. So as long as the client did not disconnect
> it should be good.
> With seamless migration that happens even before (qemu-kvm)
> migration starts.
Once again I will have to check exactly what's going on as
dev->num_clients is definitely 0 when attach_to_red_agent() gets called.
>
> >This commit changes the logic in spice_server_char_device_add_interface(),
> >and when there is migrate data pending in reds->agent_state.mig_data, we
> >only attempt to restore it after successfully initializing/starting the
> >needed char device (which will ensure that dev->num_clients is not 0).
>
> I think that if the client is connected we should have dev->num_clients set
> to 1.
Ah, if this is the real bug, I'll have to look again at all that ;) When
I hit the bug, there is no client attached when attach_to_red_agent is
called.
Christophe
More information about the Spice-devel
mailing list