[PATCH 3/4] mm: simplify device private page handling in hmm_range_fault

Fri Mar 20 00:03:45 UTC 2020

On Thu, Mar 19, 2020 at 03:56:50PM -0700, Ralph Campbell wrote:
> Adding linux-kselftest at vger.kernel.org for the test config question.
> 
> On 3/19/20 11:17 AM, Jason Gunthorpe wrote:
> > On Tue, Mar 17, 2020 at 04:14:31PM -0700, Ralph Campbell wrote:
> > > 
> > > On 3/17/20 5:59 AM, Christoph Hellwig wrote:
> > > > On Tue, Mar 17, 2020 at 09:47:55AM -0300, Jason Gunthorpe wrote:
> > > > > I've been using v7 of Ralph's tester and it is working well - it has
> > > > > DEVICE_PRIVATE support so I think it can test this flow too. Ralph are
> > > > > you able?
> > > > > 
> > > > > This hunk seems trivial enough to me, can we include it now?
> > > > 
> > > > I can send a separate patch for it once the tester covers it.  I don't
> > > > want to add it to the original patch as it is a significant behavior
> > > > change compared to the existing code.
> > > > 
> > > 
> > > Attached is an updated version of my HMM tests based on linux-5.6.0-rc6.
> > > I ran this OK with Jason's 8+1 HMM patches, Christoph's 1-5 misc HMM clean ups,
> > > and Christoph's 1-4 device private page changes applied.
> > 
> > I'd like to get this to mergable, it looks pretty good now, but I have
> > no idea about selftests - and I'm struggling to even compile the tools
> > dir
> > 
> > > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> > > index 69def4a9df00..4d22ce7879a7 100644
> > > +++ b/lib/Kconfig.debug
> > > @@ -2162,6 +2162,18 @@ config TEST_MEMINIT
> > >   	  If unsure, say N.
> > > +config TEST_HMM
> > > +	tristate "Test HMM (Heterogeneous Memory Management)"
> > > +	depends on DEVICE_PRIVATE
> > > +	select HMM_MIRROR
> > > +        select MMU_NOTIFIER
> > 
> > extra spaces
> 
> Will fix in v8.
> 
> > In general I wonder if it even makes sense that DEVICE_PRIVATE is user
> > selectable?
> 
> Should tests enable the feature or the feature enable the test?
> IMHO, if the feature is being compiled into the kernel, that should
> enable the menu item for the test. If the feature isn't selected,
> no need to test it :-)

I ment if DEVICE_PRIVATE should be a user selectable option at all, or
should it be turned on when a driver like nouveau is selected.

Is there some downside to enabling DEVICE_PRIVATE?

> > The notifier holds a mmgrab, no need for another one
> 
> OK. I'll replace dmirror->mm with dmirror->notifier.mm.

Right that is good too

> > > +	filp->private_data = dmirror;
> > 
> > Not sure what this comment means
> 
> I'll change the comment to:
> 	  /*
>          * The first open of the device character file registers the address
>          * space of the process doing the open() system call with the device.
>          * Subsequent file opens by other processes will have access to the
>          * first process' address space.
>          */

How does this happen? The function looks like it always does the same thing

> > > +static bool dmirror_interval_invalidate(struct mmu_interval_notifier *mni,
> > > +				const struct mmu_notifier_range *range,
> > > +				unsigned long cur_seq)
> > > +{
> > > +	struct dmirror *dmirror = container_of(mni, struct dmirror, notifier);
> > > +	struct mm_struct *mm = dmirror->mm;
> > > +
> > > +	/*
> > > +	 * If the process doesn't exist, we don't need to invalidate the
> > > +	 * device page table since the address space will be torn down.
> > > +	 */
> > > +	if (!mmget_not_zero(mm))
> > > +		return true;
> > 
> > Why? Don't the notifiers provide for this already.
> > 
> > mmget_not_zero() is required before calling hmm_range_fault() though

Oh... This is the invalidate_all path during invalidation

IMHO you should test the invalidation reason in the range to exclude
this.

But xa_erase looks totally safe so there should be no reason to do
that.

> This is a workaround for a problem I don't quite understand.
> If you change tools/testing/selftests/vm/hmm-tests.c line 868 to
> 	ASSERT_EQ(ret, -1);
> Then the test will abort, core dump, and cause two problems,
> 1) the migrated page will be faulted back to system memory in order to write
>    it to the core dump. This triggers lockdep_assert_held(&walk.mm->mmap_sem)
>    in walk_page_range().

Has the migration stuff become entangled with the xarray?

> [  137.980718] Code: 80 2f 1a 83 c6 05 e9 8d 7b 01 01 e8 3e b1 b1 fe e9 05 ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 41 56 41 55 41 54 55 <48> 89 fd 53 4c 8d 6d 10 e8 3c fc ff ff 49 89 c4 4c 89 e0 83 e0 03
> [  137.999461] RSP: 0018:ffffc900015e77c8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
> [  138.007028] RAX: ffff8886e508c408 RBX: 0000000000000000 RCX: ffffffff82626c89
> [  138.014159] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: ffffc900015e78a0
> [  138.021293] RBP: ffffc900015e78a0 R08: ffffffff811461c4 R09: fffff520002bcf17
> [  138.028426] R10: fffff520002bcf16 R11: 0000000000000003 R12: 0000000002606d10
> [  138.035557] R13: ffff8886e508c448 R14: 0000000000000031 R15: ffffffffa06546a0
> [  138.042701]  ? do_raw_spin_lock+0x104/0x1d0
> [  138.046888]  ? xas_store+0x19/0xa60
> [  138.050390]  xas_store+0x5b3/0xa60
> [  138.053806]  ? register_lock_class+0x860/0x860
> [  138.058267]  __xa_erase+0x96/0x110
> [  138.061673]  ? xas_store+0xa60/0xa60
> [  138.065267]  xa_erase+0x19/0x30

oh, it is doing this:

static void mn_itree_release(struct mmu_notifier_subscriptions *subscriptions,
                             struct mm_struct *mm)
{
        struct mmu_notifier_range range = {
                .flags = MMU_NOTIFIER_RANGE_BLOCKABLE,
                .event = MMU_NOTIFY_RELEASE,
                .mm = mm,
                .start = 0,
                .end = ULONG_MAX,
        };

ie it is sitting doing a huge number of xa_erases, I suppose. Probably
in normal exit the notifier is removed before the mm is destroyed.

The xa_erase needs to be a bit smarter to jump over gaps in the tree
perhaps some

xa_for_each()
   xa_erase()

pattern?

> > Also I get this:
> > 
> > lib/test_hmm.c: In function ‘dmirror_devmem_fault_alloc_and_copy’:
> > lib/test_hmm.c:1041:25: warning: unused variable ‘vma’ [-Wunused-variable]
> >   1041 |  struct vm_area_struct *vma = args->vma;
> > 
> > But this is a kernel bug, due to alloc_page_vma being a #define not a
> > static inline and me having CONFIG_NUMA off in this .config
> 
> Fixed.

in gfp.h?

Jason