[PATCH 0/5] QEMU VFIO live migration

Alex Williamson alex.williamson at redhat.com
Fri Mar 15 02:24:02 UTC 2019


On Thu, 14 Mar 2019 19:05:06 -0400
Zhao Yan <yan.y.zhao at intel.com> wrote:

> On Fri, Mar 15, 2019 at 06:44:58AM +0800, Alex Williamson wrote:
> > On Wed, 13 Mar 2019 21:12:22 -0400
> > Zhao Yan <yan.y.zhao at intel.com> wrote:
> >   
> > > On Thu, Mar 14, 2019 at 03:14:54AM +0800, Alex Williamson wrote:  
> > > > On Tue, 12 Mar 2019 21:13:01 -0400
> > > > Zhao Yan <yan.y.zhao at intel.com> wrote:
> > > >     
> > > > > hi Alex
> > > > > Any comments to the sequence below?
> > > > > 
> > > > > Actaully we have some concerns and suggestions to userspace-opaque migration
> > > > > data.
> > > > > 
> > > > > 1. if data is opaque to userspace, kernel interface must be tightly bound to
> > > > > migration. 
> > > > >    e.g. vendor driver has to know state (running + not logging) should not
> > > > >    return any data, and state (running + logging) should return whole
> > > > >    snapshot first and dirty later. it also has to know qemu migration will
> > > > >    not call GET_BUFFER in state (running + not logging), otherwise, it has
> > > > >    to adjust its behavior.    
> > > > 
> > > > This all just sounds like defining the protocol we expect with the
> > > > interface.  For instance if we define a session as beginning when
> > > > logging is enabled and ending when the device is stopped and the
> > > > interface reports no more data is available, then we can state that any
> > > > partial accumulation of data is incomplete relative to migration.  If
> > > > userspace wants to initiate a new migration stream, they can simply
> > > > toggle logging.  How the vendor driver provides the data during the
> > > > session is not defined, but beginning the session with a snapshot
> > > > followed by repeated iterations of dirtied data is certainly a valid
> > > > approach.
> > > >     
> > > > > 2. vendor driver cannot ensure userspace get all the data it intends to
> > > > > save in pre-copy phase.
> > > > >   e.g. in stop-and-copy phase, vendor driver has to first check and send
> > > > >   data in previous phase.    
> > > > 
> > > > First, I don't think the device has control of when QEMU switches from
> > > > pre-copy to stop-and-copy, the protocol needs to support that
> > > > transition at any point.  However, it seems a simply data available
> > > > counter provides an indication of when it might be optimal to make such
> > > > a transition.  If a vendor driver follows a scheme as above, the
> > > > available data counter would indicate a large value, the entire initial
> > > > snapshot of the device.  As the migration continues and pages are
> > > > dirtied, the device would reach a steady state amount of data
> > > > available, depending on the guest activity.  This could indicate to the
> > > > user to stop the device.  The migration stream would not be considered
> > > > completed until the available data counter reaches zero while the
> > > > device is in the stopped|logging state.
> > > >     
> > > > > 3. if all the sequence is tightly bound to live migration, can we remove the
> > > > > logging state? what about adding two states migrate-in and migrate-out?
> > > > > so there are four states: running, stopped, migrate-in, migrate-out.
> > > > >    migrate-out is for source side when migration starts. together with
> > > > >    state running and stopped, it can substitute state logging.
> > > > >    migrate-in is for target side.    
> > > > 
> > > > In fact, Kirti's implementation specifies a data direction, but I think
> > > > we still need logging to indicate sessions.  I'd also assume that
> > > > logging implies some overhead for the vendor driver.
> > > >    
> > > ok. If you prefer logging, I'm ok with it. just found migrate-in and
> > > migrate-out are more universal againt hardware requirement changes.
> > >   
> > > > > On Tue, Mar 12, 2019 at 10:57:47AM +0800, Zhao Yan wrote:    
> > > > > > hi Alex
> > > > > > thanks for your reply.
> > > > > > 
> > > > > > So, if we choose migration data to be userspace opaque, do you think below
> > > > > > sequence is the right behavior for vendor driver to follow:
> > > > > > 
> > > > > > 1. initially LOGGING state is not set. If userspace calls GET_BUFFER to
> > > > > > vendor driver,  vendor driver should reject and return 0.    
> > > > 
> > > > What would this state mean otherwise?  If we're not logging then it
> > > > should not be expected that we can construct dirtied data from a
> > > > previous read of the state before logging was enabled (it would be
> > > > outside of the "session").  So at best this is an incomplete segment of
> > > > the initial snapshot of the device, but that presumes how the vendor
> > > > driver constructs the data.  I wouldn't necessarily mandate the vendor
> > > > driver reject it, but I think we should consider it undefined and
> > > > vendor specific relative to the migration interface.
> > > >     
> > > > > > 2. then LOGGING state is set, if userspace calls GET_BUFFER to vendor
> > > > > > driver,
> > > > > >    a. vendor driver shoud first query a whole snapshot of device memory
> > > > > >    (let's use this term to represent device's standalone memory for now),
> > > > > >    b. vendor driver returns a chunk of data just queried to userspace,
> > > > > >    while recording current pos in data.
> > > > > >    c. vendor driver finds all data just queried is finished transmitting to
> > > > > >    userspace, and queries only dirty data in device memory now.
> > > > > >    d. vendor driver returns a chunk of data just quered (this time is dirty
> > > > > >    data )to userspace while recording current pos in data
> > > > > >    e. if all data is transmited to usespace and still GET_BUFFERs come from
> > > > > >    userspace, vendor driver starts another round of dirty data query.    
> > > > 
> > > > This is a valid vendor driver approach, but it's outside the scope of
> > > > the interface definition.  A vendor driver could also decide to not
> > > > provide any data until both stopped and logging are set and then
> > > > provide a fixed, final snapshot.  The interface supports either
> > > > approach by defining the protocol to interact with it.
> > > >     
> > > > > > 3. if LOGGING state is unset then, and userpace calls GET_BUFFER to vendor
> > > > > > driver,
> > > > > >    a. if vendor driver finds there's previously untransmitted data, returns
> > > > > >    them until all transmitted.
> > > > > >    b. vendor driver then queries dirty data again and transmits them.
> > > > > >    c. at last, vendor driver queris device config data (which has to be
> > > > > >    queried at last and sent once) and transmits them.    
> > > > 
> > > > This seems broken, the vendor driver is presuming the user intentions.
> > > > If logging is unset, we return to bullet 1, reading data is undefined
> > > > and vendor specific.  It's outside of the session.
> > > >     
> > > > > > for the 1 bullet, if LOGGING state is firstly set and migration aborts
> > > > > > then,  vendor driver has to be able to detect that condition. so seemingly,
> > > > > > vendor driver has to know more qemu's migration state, like migration
> > > > > > called and failed. Do you think that's acceptable?    
> > > > 
> > > > If migration aborts, logging is cleared and the device continues
> > > > operation.  If a new migration is started, the session is initiated by
> > > > enabling logging.  Sound reasonable?  Thanks,
> > > >    
> > > 
> > > For the flow, I still have a question.
> > > There are 2 approaches below, which one do you prefer?
> > > 
> > > Approach A, in precopy stage, the sequence is
> > > 
> > > (1)
> > > .save_live_pending --> return whole snapshot size
> > > .save_live_iterate --> save whole snapshot
> > > 
> > > (2)
> > > .save_live_pending --> get dirty data, return dirty data size
> > > .save_live_iterate --> save all dirty data
> > > 
> > > (3)
> > > .save_live_pending --> get dirty data again, return dirty data size
> > > .save_live_iterate --> save all dirty data
> > > 
> > > 
> > > Approach B, in precopy stage, the sequence is
> > > (1)
> > > .save_live_pending --> return whole snapshot size
> > > .save_live_iterate --> save part of snapshot
> > > 
> > > (2)
> > > .save_live_pending --> return rest part of whole snapshot size +
> > >                               current dirty data size
> > > .save_live_iterate --> save part of snapshot 
> > > 
> > > (3) repeat (2) until whole snapshot saved.
> > > 
> > > (4) 
> > > .save_live_pending --> get diryt data and return current dirty data size
> > > .save_live_iterate --> save part of dirty data
> > > 
> > > (5)
> > > .save_live_pending --> return reset part of dirty data size +
> > > 			     delta size of dirty data
> > > .save_live_iterate --> save part of dirty data
> > > 
> > > (6)
> > > repeat (5) until precopy stops  
> > 
> > I don't really understand the question here.  If the vendor driver's
> > approach is to send a full snapshot followed by iterations of dirty
> > data, then when the user enables logging and reads the counter for
> > available data it should report the (size of the snapshot).  The next
> > time the user reads the counter, it should report the size of the
> > (size of the snapshot) - (what the user has already read) + (size of
> > the dirty data since the snapshot).  As the user continues to read past
> > the snapshot data, the available data counter transitions to reporting
> > only the size of the remaining dirty data, which is monotonically
> > increasing.  I guess this would be more similar to your approach B,
> > which seems to suggest that the interface needs to continue providing
> > data regardless of whether the user fully exhausted the available data
> > from the previous cycle.  Thanks,
> >  
> 
> Right. But regarding to the VFIO migration code in QEMU, rather than save
> one chunk each time, do you think it is better to exhaust all reported data
> from .save_live_pending in each .save_live_iterate callback? (eventhough 
> vendor driver will handle the case that if userspace cannot exhaust
> all data, VFIO QEMU can still try to save as many available data as it can
> each time).

Don't you suspect that some devices might have state that's too large
to process in each iteration?  I expect we'll need to use heuristics on
data size or time spent on each iteration round such that some devices
might be able to fully process their pending data while others will
require multiple passes or make up the balance once we've entered stop
and copy.  Thanks,

Alex


More information about the intel-gvt-dev mailing list