[PATCH 0/5] QEMU VFIO live migration

Gonglei (Arei) arei.gonglei at huawei.com
Wed Feb 20 11:28:46 UTC 2019


> -----Original Message-----
> From: Dr. David Alan Gilbert [mailto:dgilbert at redhat.com]
> Sent: Wednesday, February 20, 2019 7:02 PM
> To: Zhao Yan <yan.y.zhao at intel.com>
> Cc: cjia at nvidia.com; kvm at vger.kernel.org; aik at ozlabs.ru;
> Zhengxiao.zx at alibaba-inc.com; shuangtai.tst at alibaba-inc.com;
> qemu-devel at nongnu.org; kwankhede at nvidia.com; eauger at redhat.com;
> yi.l.liu at intel.com; eskultet at redhat.com; ziye.yang at intel.com;
> mlevitsk at redhat.com; pasic at linux.ibm.com; Gonglei (Arei)
> <arei.gonglei at huawei.com>; felipe at nutanix.com; Ken.Xue at amd.com;
> kevin.tian at intel.com; alex.williamson at redhat.com;
> intel-gvt-dev at lists.freedesktop.org; changpeng.liu at intel.com;
> cohuck at redhat.com; zhi.a.wang at intel.com; jonathan.davies at nutanix.com
> Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> 
> * Zhao Yan (yan.y.zhao at intel.com) wrote:
> > On Tue, Feb 19, 2019 at 11:32:13AM +0000, Dr. David Alan Gilbert wrote:
> > > * Yan Zhao (yan.y.zhao at intel.com) wrote:
> > > > This patchset enables VFIO devices to have live migration capability.
> > > > Currently it does not support post-copy phase.
> > > >
> > > > It follows Alex's comments on last version of VFIO live migration patches,
> > > > including device states, VFIO device state region layout, dirty bitmap's
> > > > query.
> > >
> > > Hi,
> > >   I've sent minor comments to later patches; but some minor general
> > > comments:
> > >
> > >   a) Never trust the incoming migrations stream - it might be corrupt,
> > >     so check when you can.
> > hi Dave
> > Thanks for this suggestion. I'll add more checks for migration streams.
> >
> >
> > >   b) How do we detect if we're migrating from/to the wrong device or
> > > version of device?  Or say to a device with older firmware or perhaps
> > > a device that has less device memory ?
> > Actually it's still an open for VFIO migration. Need to think about
> > whether it's better to check that in libvirt or qemu (like a device magic
> > along with verion ?).

We must keep the hardware generation is the same with one POD of public cloud
providers. But we still think about the live migration between from the the lower
generation of hardware migrated to the higher generation.

> > This patchset is intended to settle down the main device state interfaces
> > for VFIO migration. So that we can work on that and improve it.
> >
> >
> > >   c) Consider using the trace_ mechanism - it's really useful to
> > > add to loops writing/reading data so that you can see when it fails.
> > >
> > > Dave
> > >
> > Got it. many thanks~~
> >
> >
> > > (P.S. You have a few typo's grep your code for 'devcie', 'devie' and
> > > 'migrtion'
> >
> > sorry :)
> 
> No problem.
> 
> Given the mails, I'm guessing you've mostly tested this on graphics
> devices?  Have you also checked with VFIO network cards?
> 
> Also see the mail I sent in reply to Kirti's series; we need to boil
> these down to one solution.
> 
> Dave
> 
> > >
> > > > Device Data
> > > > -----------
> > > > Device data is divided into three types: device memory, device config,
> > > > and system memory dirty pages produced by device.
> > > >
> > > > Device config: data like MMIOs, page tables...
> > > >         Every device is supposed to possess device config data.
> > > >     	Usually device config's size is small (no big than 10M), and it
> > > >         needs to be loaded in certain strict order.
> > > >         Therefore, device config only needs to be saved/loaded in
> > > >         stop-and-copy phase.
> > > >         The data of device config is held in device config region.
> > > >         Size of device config data is smaller than or equal to that of
> > > >         device config region.
> > > >
> > > > Device Memory: device's internal memory, standalone and outside
> system
> > > >         memory. It is usually very big.
> > > >         This kind of data needs to be saved / loaded in pre-copy and
> > > >         stop-and-copy phase.
> > > >         The data of device memory is held in device memory region.
> > > >         Size of devie memory is usually larger than that of device
> > > >         memory region. qemu needs to save/load it in chunks of size of
> > > >         device memory region.
> > > >         Not all device has device memory. Like IGD only uses system
> memory.
> > > >
> > > > System memory dirty pages: If a device produces dirty pages in system
> > > >         memory, it is able to get dirty bitmap for certain range of
> system
> > > >         memory. This dirty bitmap is queried in pre-copy and
> stop-and-copy
> > > >         phase in .log_sync callback. By setting dirty bitmap in .log_sync
> > > >         callback, dirty pages in system memory will be save/loaded by
> ram's
> > > >         live migration code.
> > > >         The dirty bitmap of system memory is held in dirty bitmap
> region.
> > > >         If system memory range is larger than that dirty bitmap region
> can
> > > >         hold, qemu will cut it into several chunks and get dirty bitmap in
> > > >         succession.
> > > >
> > > >
> > > > Device State Regions
> > > > --------------------
> > > > Vendor driver is required to expose two mandatory regions and another
> two
> > > > optional regions if it plans to support device state management.
> > > >
> > > > So, there are up to four regions in total.
> > > > One control region: mandatory.
> > > >         Get access via read/write system call.
> > > >         Its layout is defined in struct vfio_device_state_ctl
> > > > Three data regions: mmaped into qemu.
> > > >         device config region: mandatory, holding data of device config
> > > >         device memory region: optional, holding data of device memory
> > > >         dirty bitmap region: optional, holding bitmap of system
> memory
> > > >                             dirty pages
> > > >
> > > > (The reason why four seperate regions are defined is that the unit of
> mmap
> > > > system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> > > > control and three mmaped regions for data seems better than one big
> region
> > > > padded and sparse mmaped).
> > > >
> > > >
> > > > kernel device state interface [1]
> > > > --------------------------------------
> > > > #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> > > > #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > > > #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > > >
> > > > #define VFIO_DEVICE_STATE_RUNNING 0
> > > > #define VFIO_DEVICE_STATE_STOP 1
> > > > #define VFIO_DEVICE_STATE_LOGGING 2
> > > >
> > > > #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > > > #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> > > > #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> > > >
> > > > struct vfio_device_state_ctl {
> > > > 	__u32 version;		  /* ro */
> > > > 	__u32 device_state;       /* VFIO device state, wo */
> > > > 	__u32 caps;		 /* ro */
> > > >         struct {
> > > > 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */
> > > > 		__u64 size;    /*rw*/
> > > > 	} device_config;
> > > > 	struct {
> > > > 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */
> > > > 		__u64 size;     /* rw */
> > > >                 __u64 pos; /*the offset in total buffer of device
> memory*/
> > > > 	} device_memory;
> > > > 	struct {
> > > > 		__u64 start_addr; /* wo */
> > > > 		__u64 page_nr;   /* wo */
> > > > 	} system_memory;
> > > > };
> > > >
> > > > Devcie States
> > > > -------------
> > > > After migration is initialzed, it will set device state via writing to
> > > > device_state field of control region.
> > > >
> > > > Four states are defined for a VFIO device:
> > > >         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP
> > > >
> > > > RUNNING: In this state, a VFIO device is in active state ready to receive
> > > >         commands from device driver.
> > > >         It is the default state that a VFIO device enters initially.
> > > >
> > > > STOP:  In this state, a VFIO device is deactivated to interact with
> > > >        device driver.
> > > >
> > > > LOGGING: a special state that it CANNOT exist independently. It must be
> > > >        set alongside with state RUNNING or STOP (i.e. RUNNING &
> LOGGING,
> > > >        STOP & LOGGING).
> > > >        Qemu will set LOGGING state on in .save_setup callbacks, then
> vendor
> > > >        driver can start dirty data logging for device memory and system
> > > >        memory.
> > > >        LOGGING only impacts device/system memory. They return
> whole
> > > >        snapshot outside LOGGING and dirty data since last get
> operation
> > > >        inside LOGGING.
> > > >        Device config should be always accessible and return whole
> config
> > > >        snapshot regardless of LOGGING state.
> > > >
> > > > Note:
> > > > The reason why RUNNING is the default state is that device's active state
> > > > must not depend on device state interface.
> > > > It is possible that region vfio_device_state_ctl fails to get registered.
> > > > In that condition, a device needs be in active state by default.
> > > >
> > > > Get Version & Get Caps
> > > > ----------------------
> > > > On migration init phase, qemu will probe the existence of device state
> > > > regions of vendor driver, then get version of the device state interface
> > > > from the r/w control region.
> > > >
> > > > Then it will probe VFIO device's data capability by reading caps field of
> > > > control region.
> > > >         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > > >         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load
> data of
> > > >         device memory in pre-copy and stop-and-copy phase. The data
> of
> > > >         device memory is held in device memory region.
> > > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty
> pages
> > > >         produced by VFIO device during pre-copy and stop-and-copy
> phase.
> > > >         The dirty bitmap of system memory is held in dirty bitmap
> region.
> > > >
> > > > If failing to find two mandatory regions and optional data regions
> > > > corresponding to data caps or version mismatching, it will setup a
> > > > migration blocker and disable live migration for VFIO device.
> > > >
> > > >
> > > > Flows to call device state interface for VFIO live migration
> > > > ------------------------------------------------------------
> > > >
> > > > Live migration save path:
> > > >
> > > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE
> STATE)
> > > >
> > > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > > >  |
> > > > MIGRATION_STATUS_SAVE_SETUP
> > > >  |
> > > >  .save_setup callback -->
> > > >  get device memory size (whole snapshot size)
> > > >  get device memory buffer (whole snapshot data)
> > > >  set device state --> VFIO_DEVICE_STATE_RUNNING &
> VFIO_DEVICE_STATE_LOGGING
> > > >  |
> > > > MIGRATION_STATUS_ACTIVE
> > > >  |
> > > >  .save_live_pending callback --> get device memory size (dirty data)
> > > >  .save_live_iteration callback --> get device memory buffer (dirty data)
> > > >  .log_sync callback --> get system memory dirty bitmap
> > > >  |
> > > > (vcpu stops) --> set device state -->
> > > >  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
> > > >  |
> > > > .save_live_complete_precopy callback -->
> > > >  get device memory size (dirty data)
> > > >  get device memory buffer (dirty data)
> > > >  get device config size (whole snapshot size)
> > > >  get device config buffer (whole snapshot data)
> > > >  |
> > > > .save_cleanup callback -->  set device state -->
> VFIO_DEVICE_STATE_STOP
> > > > MIGRATION_STATUS_COMPLETED
> > > >
> > > > MIGRATION_STATUS_CANCELLED or
> > > > MIGRATION_STATUS_FAILED
> > > >  |
> > > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > > >
> > > >
> > > > Live migration load path:
> > > >
> > > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE
> STATE)
> > > >
> > > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > > >  |
> > > > (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
> > > >  |
> > > > MIGRATION_STATUS_ACTIVE
> > > >  |
> > > > .load state callback -->
> > > >  set device memory size, set device memory buffer, set device config
> size,
> > > >  set device config buffer
> > > >  |
> > > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > > >  |
> > > > MIGRATION_STATUS_COMPLETED
> > > >
> > > >
> > > >
> > > > In source VM side,
> > > > In precopy phase,
> > > > if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> > > > qemu will first get whole snapshot of device memory in .save_setup
> > > > callback, and then it will get total size of dirty data in device memory in
> > > > .save_live_pending callback by reading device_memory.size field of
> control
> > > > region.
> > > > Then in .save_live_iteration callback, it will get buffer of device memory's
> > > > dirty data chunk by chunk from device memory region by writing pos &
> > > > action (GET_BUFFER) to device_memory.pos & device_memory.action
> fields of
> > > > control region. (size of each chunk is the size of device memory data
> > > > region).
> > > > .save_live_pending and .save_live_iteration may be called several times in
> > > > precopy phase to get dirty data in device memory.
> > > >
> > > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy
> phase
> > > > like .save_setup, .save_live_pending, .save_live_iteration will not call
> > > > vendor driver's device state interface to get data from devcie memory.
> > > >
> > > > In precopy phase, if a device has
> VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY on,
> > > > .log_sync callback will get system memory dirty bitmap from dirty bitmap
> > > > region by writing system memory's start address, page count and action
> > > > (GET_BITMAP) to "system_memory.start_addr",
> "system_memory.page_nr", and
> > > > "system_memory.action" fields of control region.
> > > > If page count passed in .log_sync callback is larger than the bitmap size
> > > > the dirty bitmap region supports, Qemu will cut it into chunks and call
> > > > vendor driver's get system memory dirty bitmap interface.
> > > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback
> just
> > > > returns without call to vendor driver.
> > > >
> > > > In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> > > > in save_live_complete_precopy callback,
> > > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> > > > get device memory size and get device memory buffer will be called again.
> > > > After that,
> > > > device config data is get from device config region by reading
> > > > devcie_config.size of control region and writing action (GET_BITMAP) to
> > > > device_config.action of control region.
> > > > Then after migration completes, in cleanup handler, LOGGING state will
> be
> > > > cleared (i.e. deivce state is set to STOP).
> > > > Clearing LOGGING state in cleanup handler is in consideration of the case
> > > > of "migration failed" and "migration cancelled". They can also leverage
> > > > the cleanup handler to unset LOGGING state.
> > > >
> > > >
> > > > References
> > > > ----------
> > > > 1. kernel side implementation of Device state interfaces:
> > > > https://patchwork.freedesktop.org/series/56876/
> > > >
> > > >
> > > > Yan Zhao (5):
> > > >   vfio/migration: define kernel interfaces
> > > >   vfio/migration: support device of device config capability
> > > >   vfio/migration: tracking of dirty page in system memory
> > > >   vfio/migration: turn on migration
> > > >   vfio/migration: support device memory capability
> > > >
> > > >  hw/vfio/Makefile.objs         |   2 +-
> > > >  hw/vfio/common.c              |  26 ++
> > > >  hw/vfio/migration.c           | 858
> ++++++++++++++++++++++++++++++++++++++++++
> > > >  hw/vfio/pci.c                 |  10 +-
> > > >  hw/vfio/pci.h                 |  26 +-
> > > >  include/hw/vfio/vfio-common.h |   1 +
> > > >  linux-headers/linux/vfio.h    | 260 +++++++++++++
> > > >  7 files changed, 1174 insertions(+), 9 deletions(-)
> > > >  create mode 100644 hw/vfio/migration.c
> > > >
> > > > --
> > > > 2.7.4
> > > >
> > > --
> > > Dr. David Alan Gilbert / dgilbert at redhat.com / Manchester, UK
> > > _______________________________________________
> > > intel-gvt-dev mailing list
> > > intel-gvt-dev at lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
> --
> Dr. David Alan Gilbert / dgilbert at redhat.com / Manchester, UK


More information about the intel-gvt-dev mailing list