[PATCH 2/5] vfio/migration: support device of device config capability
Cornelia Huck
cohuck at redhat.com
Thu Feb 21 10:56:46 UTC 2019
On Wed, 20 Feb 2019 17:54:00 -0500
Zhao Yan <yan.y.zhao at intel.com> wrote:
> On Tue, Feb 19, 2019 at 03:37:24PM +0100, Cornelia Huck wrote:
> > On Tue, 19 Feb 2019 16:52:27 +0800
> > Yan Zhao <yan.y.zhao at intel.com> wrote:
> >
> > > Device config is the default data that every device should have. so
> > > device config capability is by default on, no need to set.
> > >
> > > - Currently two type of resources are saved/loaded for device of device
> > > config capability:
> > > General PCI config data, and Device config data.
> > > They are copies as a whole when precopy is stopped.
> > >
> > > Migration setup flow:
> > > - Setup device state regions, check its device state version and capabilities.
> > > Mmap Device Config Region and Dirty Bitmap Region, if available.
> > > - If device state regions are failed to get setup, a migration blocker is
> > > registered instead.
> > > - Added SaveVMHandlers to register device state save/load handlers.
> > > - Register VM state change handler to set device's running/stop states.
> > > - On migration startup on source machine, set device's state to
> > > VFIO_DEVICE_STATE_LOGGING
> > >
> > > Signed-off-by: Yan Zhao <yan.y.zhao at intel.com>
> > > Signed-off-by: Yulei Zhang <yulei.zhang at intel.com>
> > > ---
> > > hw/vfio/Makefile.objs | 2 +-
> > > hw/vfio/migration.c | 633 ++++++++++++++++++++++++++++++++++++++++++
> > > hw/vfio/pci.c | 1 -
> > > hw/vfio/pci.h | 25 +-
> > > include/hw/vfio/vfio-common.h | 1 +
> > > 5 files changed, 659 insertions(+), 3 deletions(-)
> > > create mode 100644 hw/vfio/migration.c
> > >
> > > diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> > > index 8b3f664..f32ff19 100644
> > > --- a/hw/vfio/Makefile.objs
> > > +++ b/hw/vfio/Makefile.objs
> > > @@ -1,6 +1,6 @@
> > > ifeq ($(CONFIG_LINUX), y)
> > > obj-$(CONFIG_SOFTMMU) += common.o
> > > -obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o
> > > +obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o migration.o
> >
> > I think you want to split the migration code: The type-independent
> > code, and the pci-specific code.
> >
> ok. actually, now only saving/loading of pci generic config data is
> pci-specific. the data getting/setting through device state
> interfaces are type-independent.
Yes. If it has capability chains, it can probably be made to work.
>
> > > obj-$(CONFIG_VFIO_CCW) += ccw.o
> > > obj-$(CONFIG_SOFTMMU) += platform.o
> > > obj-$(CONFIG_VFIO_XGMAC) += calxeda-xgmac.o
(...)
> > > +static int vfio_save_data_device_config(VFIOPCIDevice *vdev, QEMUFile *f)
> > > +{
> > > + VFIODevice *vbasedev = &vdev->vbasedev;
> > > + VFIORegion *region_ctl =
> > > + &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > > + VFIORegion *region_config =
> > > + &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
> > > + void *dest;
> > > + uint32_t sz;
> > > + uint8_t *buf = NULL;
> > > + uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
> > > + uint64_t len = vdev->migration->devconfig_size;
> > > +
> > > + qemu_put_be64(f, len);
> >
> > Why big endian? (Generally, do we need any endianness considerations?)
> >
> I think big endian is the endian for qemu to save file.
> as long as qemu_put and qemu_get use the same endian, it will be no
> problem.
> do you think so?
Yes, as long we are explicit about the endianness. I'm not sure whether
e.g. power even has the ability to mix endianness.
(...)
> > > +static int vfio_set_device_state(VFIOPCIDevice *vdev,
> > > + uint32_t dev_state)
> > > +{
> > > + VFIODevice *vbasedev = &vdev->vbasedev;
> > > + VFIORegion *region =
> > > + &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > > + uint32_t sz = sizeof(dev_state);
> > > +
> > > + if (!vdev->migration) {
> > > + return -1;
> > > + }
> > > +
> > > + if (pwrite(vbasedev->fd, &dev_state, sz,
> > > + region->fd_offset +
> > > + offsetof(struct vfio_device_state_ctl, device_state))
> > > + != sz) {
> > > + error_report("vfio: Failed to set device state %d", dev_state);
> >
> > Can the kernel reject this if a state transition is not allowed (or are
> > all transitions allowed?)
> >
> yes, kernel can reject some state transition if it's not allowed.
> But currently all transitions are allowed.
> Maybe a check of self-to-self transition is needed in kernel.
Self-to-self looks benign enough to simply return success early.
>
> > > + return -1;
> > > + }
> > > + vdev->migration->device_state = dev_state;
> > > + return 0;
> > > +}
(...)
> > > +static int vfio_check_devstate_version(VFIOPCIDevice *vdev)
> > > +{
> > > + VFIODevice *vbasedev = &vdev->vbasedev;
> > > + VFIORegion *region =
> > > + &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > > +
> > > + uint32_t version;
> > > + uint32_t size = sizeof(version);
> > > +
> > > + if (pread(vbasedev->fd, &version, size,
> > > + region->fd_offset +
> > > + offsetof(struct vfio_device_state_ctl, version))
> > > + != size) {
> > > + error_report("%s Failed to read version of device state interfaces",
> > > + vbasedev->name);
> > > + return -1;
> > > + }
> > > +
> > > + if (version != VFIO_DEVICE_STATE_INTERFACE_VERSION) {
> > > + error_report("%s migration version mismatch, right version is %d",
> > > + vbasedev->name, VFIO_DEVICE_STATE_INTERFACE_VERSION);
> >
> > So, we require an exact match... or should we allow to extend the
> > interface in an backwards-compatible way, in which case we'd require
> > (QEMU interface version) <= (kernel interface version)?
> >
> currently yes. we can discuss on that.
If we want to allow that, we need to have a strictly monotonous
progression of versions here (which means dragging along old
compatibility code for basically forever). Maintaining a table about
which version is compatible with which other version would get insane
pretty quickly.
Can we somehow accommodate more optional regions via capabilities?
Maybe via optional vmstates?
> > > + return -1;
> > > + }
> > > +
> > > + return 0;
> > > +}
(...)
> > > +static void vfio_pci_load_config(VFIOPCIDevice *vdev, QEMUFile *f)
> > > +{
> > > + PCIDevice *pdev = &vdev->pdev;
> > > + uint32_t ctl, msi_lo, msi_hi, msi_data, bar_cfg, i;
> > > + bool msi_64bit;
> > > +
> > > + /* retore pci bar configuration */
> > > + ctl = pci_default_read_config(pdev, PCI_COMMAND, 2);
> > > + vfio_pci_write_config(pdev, PCI_COMMAND,
> > > + ctl & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
> > > + for (i = 0; i < PCI_ROM_SLOT; i++) {
> > > + bar_cfg = qemu_get_be32(f);
> > > + vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar_cfg, 4);
> > > + }
> > > + vfio_pci_write_config(pdev, PCI_COMMAND,
> > > + ctl | PCI_COMMAND_IO | PCI_COMMAND_MEMORY, 2);
> > > +
> > > + /* restore msi configuration */
> > > + ctl = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS, 2);
> > > + msi_64bit = !!(ctl & PCI_MSI_FLAGS_64BIT);
> > > +
> > > + vfio_pci_write_config(&vdev->pdev,
> > > + pdev->msi_cap + PCI_MSI_FLAGS,
> > > + ctl & (!PCI_MSI_FLAGS_ENABLE), 2);
> > > +
> > > + msi_lo = qemu_get_be32(f);
> > > + vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO, msi_lo, 4);
> > > +
> > > + if (msi_64bit) {
> > > + msi_hi = qemu_get_be32(f);
> > > + vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> > > + msi_hi, 4);
> > > + }
> > > + msi_data = qemu_get_be32(f);
> > > + vfio_pci_write_config(pdev,
> > > + pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> > > + msi_data, 2);
> > > +
> > > + vfio_pci_write_config(&vdev->pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> > > + ctl | PCI_MSI_FLAGS_ENABLE, 2);
> > > +
> >
> > Ok, this function is indeed pci-specific and probably should be moved
> > to the vfio-pci code (other types could hook themselves up in the same
> > place, then).
> >
> yes, this is the only pci-specific code.
> maybe using VFIO_DEVICE_TYPE_PCI as a sign to decide whether to save/load
> pci config data?
> or as Dave said, put saving/loading of pci config data into
> VMStateDescription interface?
I like having an interface like vmstate where other types can register
themselves better than introducing conditional handling based on
hard-coded type values.
More information about the intel-gvt-dev
mailing list