[RFC PATCH] KVM: Introduce KVM VIRTIO device

Mon Dec 18 02:58:40 UTC 2023

On Fri, Dec 15, 2023 at 02:23:48PM +0800, Tian, Kevin wrote:
> > From: Zhao, Yan Y <yan.y.zhao at intel.com>
> > Sent: Thursday, December 14, 2023 6:35 PM
> > 
> > - For host non-MMIO pages,
> >   * virtio guest frontend and host backend driver should be synced to use
> >     the same memory type to map a buffer. Otherwise, there will be
> >     potential problem for incorrect memory data. But this will only impact
> >     the buggy guest alone.
> >   * for live migration,
> >     as QEMU will read all guest memory during live migration, page aliasing
> >     could happen.
> >     Current thinking is to disable live migration if a virtio device has
> >     indicated its noncoherent state.
> >     As a follow-up, we can discuss other solutions. e.g.
> >     (a) switching back to coherent path before starting live migration.
> 
> both guest/host switching to coherent or host-only?
> 
> host-only certainly is problematic if guest is still using non-coherent.
Both.

> on the other hand I'm not sure whether the host/guest gfx stack is
> capable of switching between coherent and non-coherent path in-fly
> when the buffer is right being rendered.
> 
Yes. I'm also not sure about it. But it's an option though.

> >     (b) read/write of guest memory with clflush during live migration.
> 
> write is irrelevant as it's only done in the resume path where the
> guest is not running.
Given host write is with PAT WB and hardware is in no-snoop mode, is it
better to perform cache flush after host write?
(can do more investigation to check if it's necessary).

BTW, there's also post-copy live migration, in which case the guest is
running :)

> 
> > 
> > Implementation Consideration
> > ===
> > There is a previous series [1] from google to serve the same purpose to
> > let KVM be aware of virtio GPU's noncoherent DMA status. That series
> > requires a new memslot flag, and special memslots in user space.
> > 
> > We don't choose to use memslot flag to request honoring guest memory
> > type.
> 
> memslot flag has the potential to restrict the impact e.g. when using
> clflush-before-read in migration? Of course the implication is to
> honor guest type only for the selected slot in KVM instead of applying
> to the entire guest memory as in previous series (which selects this
> way because vmx_get_mt_mask() is in perf-critical path hence not
> good to check memslot flag?)
>
I think checking memslot flag in itself is all right.
But memslot flag does not contain the memory type that host is using for
the memslot.
On the other hand, virtio GPU is not the only source of non-coherent DMAs.
Memslot flag way is not applicable to pass-through GPUs, due to lacking of
coordination between guest and host.

> > Instead we hope to make the honoring request to be explicit (not tied to a
> > memslot flag). This is because once guest memory type is honored, not only
> > memory used by guest virtio device, but all guest memory is facing page
> > aliasing issue potentially. KVM needs a generic solution to take care of
> > page aliasing issue rather than counting on memory type of a special
> > memslot being aligned in host and guest.
> > (we can discuss what a generic solution to handle page aliasing issue will
> > look like in later follow-up series).
> > 
> > On the other hand, we choose to introduce a KVM virtio device rather than
> > just provide an ioctl to wrap kvm_arch_[un]register_noncoherent_dma()
> > directly, which is based on considerations that
> 
> I wonder it's over-engineered for the purpose.
> 
> why not just introducing a KVM_CAP and allowing the VMM to enable?
As we hope to increase non-coherent DMA count on hot-plug of a non-coherent
device and decrease non-coherent DMA count on hot-unplug of the non-coherent
device, a KVM_CAP looks requiring user to maintain a ref count before turning
on/off, which is less desired. Agree?

> KVM doesn't need to know the exact source of requiring it...
Maybe we can use the source info in a way like this:
1. indicate the source is not a passthrough device
2. record relationship between GPA and memory type.

Then, if KVM knows non-coherent DMAs do not contain any passthrough 
devices, it can force a GPA's memory type (by ignoring guest PAT) to the
one specified by host (in 2), so as to avoid cache flush operations before
live migration.

If there are passthrough devices involved later, we can zap the EPT and
rebuild memory type to honor guest PAT, resorting to cache flush before
live migration to maintain coherency.