[PATCH hmm 00/15] Consolidate the mmu notifier interval_tree and locking

Mon Oct 21 15:12:26 UTC 2019

On Mon, Oct 21, 2019 at 02:28:46PM +0000, Koenig, Christian wrote:
> Am 21.10.19 um 15:57 schrieb Jason Gunthorpe:
> > On Sun, Oct 20, 2019 at 02:21:42PM +0000, Koenig, Christian wrote:
> >> Am 18.10.19 um 22:36 schrieb Jason Gunthorpe:
> >>> On Thu, Oct 17, 2019 at 04:47:20PM +0000, Koenig, Christian wrote:
> >>> [SNIP]
> >>>    
> >>>> So again how are they serialized?
> >>> The 'driver lock' thing does it, read the hmm documentation, the hmm
> >>> approach is basically the only approach that was correct of all the
> >>> drivers..
> >> Well that's what I've did, but what HMM does still doesn't looks correct
> >> to me.
> > It has a bug, but the basic flow seems to work.
> >
> > https://patchwork.kernel.org/patch/11191
> 
> Maybe wrong link? That link looks like an unrelated discussion on kernel 
> image relocation.

Sorry, it got corrupted:

https://patchwork.kernel.org/patch/11191397/

> >>> So long as the 'driver lock' is held the range cannot become
> >>> invalidated as the 'driver lock' prevents progress of invalidation.
> >> Correct, but the problem is it doesn't wait for ongoing operations to
> >> complete.
> >>
> >> See I'm talking about the following case:
> >>
> >> Thread A    Thread B
> >> invalidate_range_start()
> >>                       mmu_range_read_begin()
> >>                       get_user_pages()/hmm_range_fault()
> >>                       grab_driver_lock()
> >> Updating the ptes
> >> invalidate_range_end()
> >>
> >> As far as I can see in invalidate_range_start() the driver lock is taken
> >> to make sure that we can't start any invalidation while the driver is
> >> using the pages for a command submission.
> > Again, this uses the seqlock like scheme *and* the driver lock.
> >
> > In this case after grab_driver_lock() mmu_range_read_retry() will
> > return false if Thread A has progressed to 'updating the ptes.
> >
> > For instance here is how the concurrency resolves for retry:
> >
> >         CPU1                                CPU2
> >                                    seq = mmu_range_read_begin()
> > invalidate_range_start()
> >    invalidate_seq++
> 
> How that was order was what confusing me. But I've read up on the code 
> in mmu_range_read_begin() and found the lines I was looking for:
> 
> +    if (is_invalidating)
> +        wait_event(mmn_mm->wq,
> +               READ_ONCE(mmn_mm->invalidate_seq) != seq);
> 
> [SNIP]

Right, the basic design is that the 'seq' returned by
mmu_range_read_begin() is never currently under invalidation.

Thus if the starting seq is not invalidating, then if it doesn't
change we are guaranteed the ptes haven't changed either.

> > For the above I've simplified the mechanics of the invalidate_seq, you
> > need to look through the patch to see how it actually works.
> 
> Yea, that you also allow multiple write sides is pretty neat.

Complicated, but necessary to make the non-blocking OOM stuff able to
read the interval tree under all conditions :\

> > One of the motivations for this work is to squash that bug by adding a
> > seqlock like pattern. But the basic hmm flow and collision-retry
> > approach seems sound.
> >
> > Do you see a problem with this patch?
> 
> No, not any more.

Okay, great, thanks

> Essentially you are doing the same thing I've tried to do with the 
> original amdgpu implementation. The difference is that you don't try to 
> use a per range sequence (which is a good idea, we never got that fully 
> working) and you allow multiple writers at the same time.

Yes, ODP had the per-range sequence and it never worked right
either. Keeping track of things during the invalidate_end was too complex

> Feel free to stitch an Acked-by: Christian König 
> <christian.koenig at amd.com> on patch #2,

Thanks! Can you also take some review and test for the AMD related
patches? These were quite hard to make, it is very likely I've made an
error..

> but you still doing a bunch of things in there which are way beyond
> my understanding (e.g. where are all the SMP barriers?).

The only non-locked data is 'struct mmu_range_notifier->invalidate_seq'

On the write side it uses a WRITE_ONCE. The 'seq' it writes is
generated under the mmn_mm->lock spinlock (held before and after the
WRITE_ONCE) such that all concurrent WRITE_ONCE's are storing the same
value. 

Essentially the spinlock is providing the barrier to order write:

invalidate_range_start():
 spin_lock(&mmn_mm->lock);
 mmn_mm->active_invalidate_ranges++;
 mmn_mm->invalidate_seq |= 1;
 *seq = mmn_mm->invalidate_seq;
 spin_unlock(&mmn_mm->lock);

 WRITE_ONCE(mrn->invalidate_seq, cur_seq);

invalidate_range_end()
 spin_lock(&mmn_mm->lock);
 if (--mmn_mm->active_invalidate_ranges)
    mmn_mm->invalidate_seq++
 spin_unlock(&mmn_mm->lock);

ie when we do invalidate_seq++, due to the active_invalidate_ranges
counter and the spinlock, we know all other WRITE_ONCE's have passed
their spin_unlock and no concurrent ones are starting. The spinlock is
providing the barrier here.

On the read side.. It is a similar argument, except with the
driver_lock:

mmu_range_read_begin()
  seq = READ_ONCE(mrn->invalidate_seq);

Here 'seq' may be the current value, or it may be an older
value. Ordering is eventually provided by the driver_lock:

mn_tree_invalidate_start()
 [..]
 WRITE_ONCE(mrn->invalidate_seq, cur_seq);
 driver_lock()
 driver_unlock()

vs
 driver_lock()
   mmu_range_read_begin()
     return seq != READ_ONCE(mrn->invalidate_seq);
 driver_unlock()

Here if mn_tree_invalidate_start() has passed driver_unlock() then
because we do driver_lock() before mmu_range_read_begin() we must
observe the WRITE_ONCE. ie the driver_unlock()/driver_lock() provide
the pair'd barrier.

If mn_tree_invalidate_start() has not passed driver_lock() then the
mmu_range_read_begin() side prevents it from passing driver_lock()
while it holds the lock. In this case it is OK if we don't observe the
WRITE_ONCE() that was done just before as the invalidate_start()
thread can't proceed to invalidation.

It is very unusual locking, it would be great if others could help
look at it!

The unusual bit in all of this is using a lock's critical region to
'protect' data for read, but updating that same data before the lock's
critical secion. ie relying on the unlock barrier to 'release' program
ordered stores done before the lock's own critical region, and the
lock side barrier to 'acquire' those stores.

This approach is borrowed from the hmm mirror implementation..

If for some reason the scheme doesn't work, then the fallback is to
expand the mmn_mm->lock spinlock to protect the mrn->invalidate_seq at
some cost in performance.

Jason