[PATCH v3 2/8] mm: remove extra ZONE_DEVICE struct page refcount
Felix Kuehling
felix.kuehling at amd.com
Mon Jun 28 16:46:59 UTC 2021
Am 2021-06-17 um 3:16 p.m. schrieb Ralph Campbell:
>
> On 6/17/21 8:16 AM, Alex Sierra wrote:
>> From: Ralph Campbell <rcampbell at nvidia.com>
>>
>> ZONE_DEVICE struct pages have an extra reference count that
>> complicates the
>> code for put_page() and several places in the kernel that need to
>> check the
>> reference count to see that a page is not being used (gup, compaction,
>> migration, etc.). Clean up the code so the reference count doesn't
>> need to
>> be treated specially for ZONE_DEVICE.
>>
>> v2:
>> AS: merged this patch in linux 5.11 version
>>
>> Signed-off-by: Ralph Campbell <rcampbell at nvidia.com>
>> Signed-off-by: Alex Sierra <alex.sierra at amd.com>
>> ---
>> arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +-
>> drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +-
>> fs/dax.c | 4 +-
>> include/linux/dax.h | 2 +-
>> include/linux/memremap.h | 7 +--
>> include/linux/mm.h | 44 -----------------
>> lib/test_hmm.c | 2 +-
>> mm/internal.h | 8 +++
>> mm/memremap.c | 68 +++++++-------------------
>> mm/migrate.c | 5 --
>> mm/page_alloc.c | 3 ++
>> mm/swap.c | 45 ++---------------
>> 12 files changed, 45 insertions(+), 147 deletions(-)
>>
> I think it is great that you are picking this up and trying to revive it.
>
> However, I have a number of concerns about how it affects existing
> ZONE_DEVICE
> MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_FS_DAX users and I don't see this
> addressing them. For example, dev_dax_probe() allocates
> MEMORY_DEVICE_GENERIC
> struct pages and then:
> dev_dax_fault()
> dev_dax_huge_fault()
> __dev_dax_pte_fault()
> vmf_insert_mixed()
> which just inserts the PFN into the CPU page tables without increasing
> the page
> refcount so it is zero (whereas it was one before). But using
> get_page() will
> trigger VM_BUG_ON_PAGE() if it is enabled. There isn't any current
> notion of
> free verses allocated for these struct pages. I suppose init_page_count()
> could be called on all the struct pages in dev_dax_probe() to fix that
> though.
Hi Ralph,
For DEVICE_GENERIC pages free_zone_device_page doesn't do anything. So
I'm not sure what the reference counting is good for in this case.
Alex is going to add free_zone_device_page support for DEVICE_GENERIC
pages (patch 8 of this series). However, even then, it only does
anything if there is an actual call to put_page. Where would that call
come from in the dev_dax driver case?
>
> I'm even less clear about how to fix MEMORY_DEVICE_FS_DAX. File
> systems have clear
> allocate and free states for backing storage but there are the
> complications with
> the page cache references, etc. to consider. The >1 to 1 reference
> count seems to
> be used to tell when a page is idle (no I/O, reclaim scanners) rather
> than free
> (not allocated to any file) but I'm not 100% sure about that since I
> don't really
> understand all the issues around why a file system needs to have a DAX
> mount option
> besides knowing that the storage block size has to be a multiple of
> the page size.
The only thing that happens in free_zone_device_page for FS_DAX pages is
wake_up_var(&page->_refcount). I guess, whoever is waiting for this
wake-up will need to be prepared to see a refcount 0 instead of 1 now. I
see these callers that would need to be updated:
./fs/ext4/inode.c: error = ___wait_var_event(&page->_refcount,
./fs/ext4/inode.c- atomic_read(&page->_refcount) == 1,
./fs/ext4/inode.c- TASK_INTERRUPTIBLE, 0, 0,
./fs/ext4/inode.c- ext4_wait_dax_page(ei));
--
./fs/fuse/dax.c: return ___wait_var_event(&page->_refcount,
./fs/fuse/dax.c- atomic_read(&page->_refcount) == 1,
TASK_INTERRUPTIBLE,
./fs/fuse/dax.c- 0, 0, fuse_wait_dax_page(inode));
--
./fs/xfs/xfs_file.c: return ___wait_var_event(&page->_refcount,
./fs/xfs/xfs_file.c- atomic_read(&page->_refcount) == 1,
TASK_INTERRUPTIBLE,
./fs/xfs/xfs_file.c- 0, 0, xfs_wait_dax_page(inode));
Regarding your page-cache comment, doesn't DAX mean that those file
pages are not in the page cache?
Regards,
Felix
More information about the amd-gfx
mailing list