[PATCH V6 3/6] mm: Introduce VM_LOCKONFAULT

Thu Aug 6 08:33:54 PDT 2015

On 07/29/2015 05:42 PM, Eric B Munson wrote:
> The cost of faulting in all memory to be locked can be very high when
> working with large mappings.  If only portions of the mapping will be
> used this can incur a high penalty for locking.
>
> For the example of a large file, this is the usage pattern for a large
> statical language model (probably applies to other statical or graphical
> models as well).  For the security example, any application transacting
> in data that cannot be swapped out (credit card data, medical records,
> etc).
>
> This patch introduces the ability to request that pages are not
> pre-faulted, but are placed on the unevictable LRU when they are finally
> faulted in.  The VM_LOCKONFAULT flag will be used together with
> VM_LOCKED and has no effect when set without VM_LOCKED.  Setting the
> VM_LOCKONFAULT flag for a VMA will cause pages faulted into that VMA to
> be added to the unevictable LRU when they are faulted or if they are
> already present, but will not cause any missing pages to be faulted in.
>
> Exposing this new lock state means that we cannot overload the meaning
> of the FOLL_POPULATE flag any longer.  Prior to this patch it was used
> to mean that the VMA for a fault was locked.  This means we need the
> new FOLL_MLOCK flag to communicate the locked state of a VMA.
> FOLL_POPULATE will now only control if the VMA should be populated and
> in the case of VM_LOCKONFAULT, it will not be set.
>
> Signed-off-by: Eric B Munson <emunson at akamai.com>
> Acked-by: Kirill A. Shutemov <kirill.shutemov at linux.intel.com>
> Cc: Michal Hocko <mhocko at suse.cz>
> Cc: Vlastimil Babka <vbabka at suse.cz>
> Cc: Jonathan Corbet <corbet at lwn.net>
> Cc: "Kirill A. Shutemov" <kirill at shutemov.name>
> Cc: linux-kernel at vger.kernel.org
> Cc: dri-devel at lists.freedesktop.org
> Cc: linux-mm at kvack.org
> Cc: linux-api at vger.kernel.org
> ---
>   drivers/gpu/drm/drm_vm.c |  8 +++++++-
>   fs/proc/task_mmu.c       |  1 +
>   include/linux/mm.h       |  2 ++
>   kernel/fork.c            |  2 +-
>   mm/debug.c               |  1 +
>   mm/gup.c                 | 10 ++++++++--
>   mm/huge_memory.c         |  2 +-
>   mm/hugetlb.c             |  4 ++--
>   mm/mlock.c               |  2 +-
>   mm/mmap.c                |  2 +-
>   mm/rmap.c                |  4 ++--
>   11 files changed, 27 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/gpu/drm/drm_vm.c b/drivers/gpu/drm/drm_vm.c
> index aab49ee..103a5f6 100644
> --- a/drivers/gpu/drm/drm_vm.c
> +++ b/drivers/gpu/drm/drm_vm.c
> @@ -699,9 +699,15 @@ int drm_vma_info(struct seq_file *m, void *data)
>   		   (void *)(unsigned long)virt_to_phys(high_memory));
>
>   	list_for_each_entry(pt, &dev->vmalist, head) {
> +		char lock_flag = '-';
> +
>   		vma = pt->vma;
>   		if (!vma)
>   			continue;
> +		if (vma->vm_flags & VM_LOCKONFAULT)
> +			lock_flag = 'f';
> +		else if (vma->vm_flags & VM_LOCKED)
> +			lock_flag = 'l';
>   		seq_printf(m,
>   			   "\n%5d 0x%pK-0x%pK %c%c%c%c%c%c 0x%08lx000",
>   			   pt->pid,
> @@ -710,7 +716,7 @@ int drm_vma_info(struct seq_file *m, void *data)
>   			   vma->vm_flags & VM_WRITE ? 'w' : '-',
>   			   vma->vm_flags & VM_EXEC ? 'x' : '-',
>   			   vma->vm_flags & VM_MAYSHARE ? 's' : 'p',
> -			   vma->vm_flags & VM_LOCKED ? 'l' : '-',
> +			   lock_flag,
>   			   vma->vm_flags & VM_IO ? 'i' : '-',
>   			   vma->vm_pgoff);
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index ca1e091..38d69fc 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -579,6 +579,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)

This function has the following comment:

Don't forget to update Documentation/ on changes.

[...]

> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -92,7 +92,7 @@ retry:
>   		 */
>   		mark_page_accessed(page);
>   	}
> -	if ((flags & FOLL_POPULATE) && (vma->vm_flags & VM_LOCKED)) {
> +	if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
>   		/*
>   		 * The preliminary mapping check is mainly to avoid the
>   		 * pointless overhead of lock_page on the ZERO_PAGE
> @@ -265,6 +265,9 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
>   	unsigned int fault_flags = 0;
>   	int ret;
>
> +	/* mlock all present pages, but do not fault in new pages */
> +	if ((*flags & (FOLL_POPULATE | FOLL_MLOCK)) == FOLL_MLOCK)
> +		return -ENOENT;
>   	/* For mm_populate(), just skip the stack guard page. */
>   	if ((*flags & FOLL_POPULATE) &&
>   			(stack_guard_page_start(vma, address) ||
> @@ -850,7 +853,10 @@ long populate_vma_page_range(struct vm_area_struct *vma,
>   	VM_BUG_ON_VMA(end   > vma->vm_end, vma);
>   	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_sem), mm);
>
> -	gup_flags = FOLL_TOUCH | FOLL_POPULATE;
> +	gup_flags = FOLL_TOUCH | FOLL_MLOCK;
> +	if ((vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT)) == VM_LOCKED)
> +		gup_flags |= FOLL_POPULATE;
> +
>   	/*
>   	 * We want to touch writable mappings with a write fault in order
>   	 * to break COW, except for shared mappings because these don't COW

I think this might be breaking the populate part of mmap(MAP_POPULATE & 
~MAP_LOCKED) case, if I follow the execution correctly (it's far from 
simple...)

SYSCALL_DEFINE6(mmap_pgoff... with MAP_POPULATE
   vm_mmap_pgoff(..., MAP_POPULATE...)
     do_mmap_pgoff(...MAP_POPULATE... &populate) -> populate == TRUE
     mm_populate()
       __mm_populate()
         populate_vma_page_range()

Previously, this path would have FOLL_POPULATE in gup_flags and continue 
with __get_user_pages() and faultin_page() (actually regardless of 
FOLL_POPULATE) which would fault in the pages.

After your patch, populate_vma_page_range() will set FOLL_MLOCK, but 
since VM_LOCKED is not set, FOLL_POPULATE won't be set either.
Then faultin_page() will return on the new check:

	flags & (FOLL_POPULATE | FOLL_MLOCK)) == FOLL_MLOCK