[v2 22/31] drm/xe/svm: implement functions to allocate and free device memory

Mon Apr 15 21:19:59 UTC 2024

On Mon, Apr 15, 2024 at 02:13:55PM -0600, Zeng, Oak wrote:
> Hi Matt,
> 
> > -----Original Message-----
> > From: Brost, Matthew <matthew.brost at intel.com>
> > Sent: Wednesday, April 10, 2024 6:24 PM
> > To: Zeng, Oak <oak.zeng at intel.com>
> > Cc: intel-xe at lists.freedesktop.org; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray at intel.com>; Bommu, Krishnaiah
> > <krishnaiah.bommu at intel.com>; Thomas.Hellstrom at linux.intel.com; Welty,
> > Brian <brian.welty at intel.com>
> > Subject: Re: [v2 22/31] drm/xe/svm: implement functions to allocate and free
> > device memory
> > 
> > On Tue, Apr 09, 2024 at 04:17:33PM -0400, Oak Zeng wrote:
> > > Function xe_devm_alloc_pages allocate pages from drm buddy and perform
> > > house keeping work for all the pages allocated, such as get a page
> > > refcount, keep a bitmap of all pages to denote whether a page is in
> > > use, put pages to a drm lru list for eviction purpose.
> > >
> > > Function xe_devm_free_blocks return list of memory blocks to drm buddy
> > > allocator.
> > >
> > > Function xe_devm_free_page is a call back function from hmm layer. It
> > > is called whenever a page's refcount reaches to 1. This function clears
> > > the bit of this page in the bitmap. If all the bits in the bitmap is
> > > cleared, it means all the pages have been freed, we return all the pages
> > > in this memory block back to drm buddy.
> > >
> > > Signed-off-by: Oak Zeng <oak.zeng at intel.com>
> > > Co-developed-by: Niranjana Vishwanathapura
> > <niranjana.vishwanathapura at intel.com>
> > > Signed-off-by: Niranjana Vishwanathapura
> > <niranjana.vishwanathapura at intel.com>
> > > Cc: Matthew Brost <matthew.brost at intel.com>
> > > Cc: Thomas Hellström <thomas.hellstrom at intel.com>
> > > Cc: Brian Welty <brian.welty at intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_svm.h        |   7 ++
> > >  drivers/gpu/drm/xe/xe_svm_devmem.c | 147
> > ++++++++++++++++++++++++++++-
> > 
> > See comments about file in previous patches, they apply here too.
> > 
> > >  2 files changed, 152 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> > > index 624c1581f8ba..92a3ee90d5a7 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm.h
> > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > @@ -46,4 +46,11 @@ static inline struct xe_mem_region
> > *xe_page_to_mem_region(struct page *page)
> > >  	return container_of(page->pgmap, struct xe_mem_region, pagemap);
> > >  }
> > >
> > > +int xe_devm_alloc_pages(struct xe_tile *tile,
> > > +						unsigned long npages,
> > > +						struct list_head *blocks,
> > > +						unsigned long *pfn);
> > > +
> > > +void xe_devm_free_blocks(struct list_head *blocks);
> > > +void xe_devm_page_free(struct page *page);
> > >  #endif
> > > diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > index 31af56e8285a..5ba0cd9a70b0 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > @@ -5,18 +5,161 @@
> > >
> > >  #include <linux/mm_types.h>
> > >  #include <linux/sched/mm.h>
> > > -
> > > +#include <linux/gfp.h>
> > > +#include <linux/migrate.h>
> > > +#include <linux/dma-mapping.h>
> > > +#include <linux/dma-fence.h>
> > > +#include <linux/bitops.h>
> > > +#include <linux/bitmap.h>
> > > +#include <drm/drm_buddy.h>
> > >  #include "xe_device_types.h"
> > >  #include "xe_svm.h"
> > > +#include "xe_migrate.h"
> > > +#include "xe_ttm_vram_mgr_types.h"
> > > +#include "xe_assert.h"
> > >
> > > +/**
> > > + * struct xe_svm_block_meta - svm uses this data structure to manage each
> > > + * block allocated from drm buddy. This will be set to the
> > drm_buddy_block's
> > > + * private field.
> > > + *
> > > + * @lru: used to link this block to drm's lru lists. This will be replace
> > > + * with struct drm_lru_entity later.
> > > + * @tile: tile from which we allocated this block
> > > + * @bitmap: A bitmap of each page in this block. 1 means this page is used,
> > > + * 0 means this page is idle. When all bits of this block are 0, it is time
> > > + * to return this block to drm buddy subsystem.
> > > + */
> > > +struct xe_svm_block_meta {
> > > +	struct list_head lru;
> > > +	struct xe_tile *tile;
> > > +	unsigned long bitmap[];
> > > +};
> > 
> > This looks not needed to me but admittedly haven't looked at the LRU stuff.
> > 
> > I am thinking roughly...
> > 
> > - I think we drop all this special tracking (kill xe_svm_block_meta)
> > - Have functions to allocate / free the buddy blocks, store buddy blocks in
> > userptr
> > - Blocks are allocated before migration to VRAM
> > - Blocks can be freed on either CPU fault after migration or on VMA
> >   destroy (probably depends on madvive hints for VMA where we free
> >   blocks)
> > - Blocks allocated / freed at ia chunk (xe_vma in this code) granularity
> >   (conceptually the same if we switch to 1 to N ratio between xe_vma &
> >   pt_state)
> > - block->private == memory region so we can get pfn from block
> > - When we need migrate_pfns we loop over buddy blocks populating
> > migrate.dst
> 
> I thought into your scheme. The free of device memory is not completely controlled by driver. Core mm can call back to driver to free a device memory page. The xe_devm_page_free in this series is a callback function registered to core mm. this is why in above data structure I have to have a bitmap. This bitmap is used to mark which page is freed, when all pages in a buddy block are freed, it is time to free the whole buddy block.
>

Certainly in this scenario we'd also get a mmu notifier with
MMU_NOTIFY_UNMAP when pages are being free'd too, right? The notifier
puts the VMA in the garbage collector and the blocks are free'd when VMA
is destroyed.

The garbage collector only supports complete unmaps at the moment but if
we need to support partial unmaps (likely do as a users can munmap
partial buffers) we can with garbage collector transfering ownership of
blocks from one VMA to another as needed.

It is possible I don't fully understand the ref couting scheme for pages
either and we will need to implement the dev_pagemap_ops.free_pagses
(seems likely now that I am typing) rather than the notifier scheme
described above...

If we need to do this, then roughly...

- page->zone_device_data is still the Xe chunk (xe_vma currently)
- Ref count device pages in Xe chunk (or perhaps individual block?,
  need to think about this more but certainly bitmap is overkill)
- free_pages decrements ref count
- When ref count goes to zero, free blocks

Again this seems to align with Nouveu and AMD (haven't check Nvidia's
open source driver) and aligns with the design in Xe of everything being
a chunk grainularity (e.g. no partial migrations within a chunk).

I guess we will a need to a test to do partial unmaps to figure out all
of these details... 

> In your scheme, we allocate/free at xe_vma granularity. So I imagine you would have a list of buddy blocks in usrptr, and free all blocks in the list when every pages in all blocks are freed. While my scheme free memory at the buddy block granularity - I think it natural because the buddy free interface is also block based.
> 
> You would eventually need to introduce a lru link to link each buddy block to a lru list when vram eviction come into picture.
> 
> So I just explained why above xe_svm_block_meta was introduced, such as why the bitmap and lru field is necessary to me. if you drop this data structure, they will have to show up in another way.
>

LRU will likely be at chunk grainularity too (i.e. xe_vma not at block
level).

Also in most cases xe_vma == 1 block if the buddy allocator is doing its
job so no reason to optimize for block level here.

> > 
> > Also I noticed the drm_buddy_* calls in this file are not protected by a
> > lock, we will need that. Currently it is tile->mem.vram_mgr->lock in the
> > VRAM mgr code, we either need to reach into there or move this lock to
> > common place so the VRAM manager and block allocations for SVM don't
> > race with each other.
> > 
> 
> Yes, the lock has to be added. Thanks for pointing this out. Maybe move the tile->mem.vram_mgr->lock to the xe_tile level so it can be shared b/t BO-driver and system allocator?
>

Yea, tile->mem.vram_lock might be a better location.

Matt

> Oak
> 
> > Matt
> > 
> > >
> > >  static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> > >  {
> > >  	return 0;
> > >  }
> > >
> > > -static void xe_devm_page_free(struct page *page)
> > > +static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
> > > +{
> > > +	/** DRM buddy's block offset is 0-based*/
> > > +	offset += mr->hpa_base;
> > > +
> > > +	return PHYS_PFN(offset);
> > > +}
> > > +
> > > +/** FIXME: we locked page by calling zone_device_page_init
> > > + *  in xe_devm_alloc_pages. Should we unlock pages here?
> > > + */
> > > +static void free_block(struct drm_buddy_block *block)
> > > +{
> > > +	struct xe_svm_block_meta *meta =
> > > +		(struct xe_svm_block_meta *)block->private;
> > > +	struct xe_tile *tile  = meta->tile;
> > > +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> > > +
> > > +	kfree(block->private);
> > > +	drm_buddy_free_block(mm, block);
> > > +}
> > > +
> > > +void xe_devm_page_free(struct page *page)
> > > +{
> > > +	struct drm_buddy_block *block =
> > > +					(struct drm_buddy_block *)page-
> > >zone_device_data;
> > > +	struct xe_svm_block_meta *meta =
> > > +					(struct xe_svm_block_meta *)block-
> > >private;
> > > +	struct xe_tile *tile  = meta->tile;
> > > +	struct xe_mem_region *mr = &tile->mem.vram;
> > > +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> > > +	u64 size = drm_buddy_block_size(mm, block);
> > > +	u64 pages_per_block = size >> PAGE_SHIFT;
> > > +	u64 block_pfn_first =
> > > +					block_offset_to_pfn(mr,
> > drm_buddy_block_offset(block));
> > > +	u64 page_pfn = page_to_pfn(page);
> > > +	u64 i = page_pfn - block_pfn_first;
> > > +
> > > +	xe_assert(tile->xe, i < pages_per_block);
> > > +	clear_bit(i, meta->bitmap);
> > > +	if (bitmap_empty(meta->bitmap, pages_per_block))
> > > +		free_block(block);
> > > +}
> > > +
> > > +/**
> > > + * xe_devm_alloc_pages() - allocate device pages from buddy allocator
> > > + *
> > > + * @xe_tile: which tile to allocate device memory from
> > > + * @npages: how many pages to allocate
> > > + * @blocks: used to return the allocated blocks
> > > + * @pfn: used to return the pfn of all allocated pages. Must be big enough
> > > + * to hold at @npages entries.
> > > + *
> > > + * This function allocate blocks of memory from drm buddy allocator, and
> > > + * performs initialization work: set struct page::zone_device_data to point
> > > + * to the memory block; set/initialize drm_buddy_block::private field;
> > > + * lock_page for each page allocated; add memory block to lru managers lru
> > > + * list - this is TBD.
> > > + *
> > > + * return: 0 on success
> > > + * error code otherwise
> > > + */
> > > +int xe_devm_alloc_pages(struct xe_tile *tile,
> > > +						unsigned long npages,
> > > +						struct list_head *blocks,
> > > +						unsigned long *pfn)
> > > +{
> > > +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> > > +	struct drm_buddy_block *block, *tmp;
> > > +	u64 size = npages << PAGE_SHIFT;
> > > +	int ret = 0, i, j = 0;
> > > +
> > > +	ret = drm_buddy_alloc_blocks(mm, 0, mm->size, size, PAGE_SIZE,
> > > +						blocks,
> > DRM_BUDDY_TOPDOWN_ALLOCATION);
> > > +
> > > +	if (unlikely(ret))
> > > +		return ret;
> > > +
> > > +	list_for_each_entry_safe(block, tmp, blocks, link) {
> > > +		struct xe_mem_region *mr = &tile->mem.vram;
> > > +		u64 block_pfn_first, pages_per_block;
> > > +		struct xe_svm_block_meta *meta;
> > > +		u32 meta_size;
> > > +
> > > +		size = drm_buddy_block_size(mm, block);
> > > +		pages_per_block = size >> PAGE_SHIFT;
> > > +		meta_size = BITS_TO_BYTES(pages_per_block) +
> > > +					sizeof(struct xe_svm_block_meta);
> > > +		meta = kzalloc(meta_size, GFP_KERNEL);
> > > +		bitmap_fill(meta->bitmap, pages_per_block);
> > > +		meta->tile = tile;
> > > +		block->private = meta;
> > > +		block_pfn_first =
> > > +					block_offset_to_pfn(mr,
> > drm_buddy_block_offset(block));
> > > +		for(i = 0; i < pages_per_block; i++) {
> > > +			struct page *page;
> > > +
> > > +			pfn[j++] = block_pfn_first + i;
> > > +			page = pfn_to_page(block_pfn_first + i);
> > > +			/**Lock page per hmm requirement, see hmm.rst.*/
> > > +			zone_device_page_init(page);
> > > +			page->zone_device_data = block;
> > > +		}
> > > +	}
> > > +
> > > +	return ret;
> > > +}
> > > +
> > > +/**
> > > + * xe_devm_free_blocks() - free all memory blocks
> > > + *
> > > + * @blocks: memory blocks list head
> > > + */
> > > +void xe_devm_free_blocks(struct list_head *blocks)
> > >  {
> > > +	struct drm_buddy_block *block, *tmp;
> > > +
> > > +	list_for_each_entry_safe(block, tmp, blocks, link)
> > > +		free_block(block);
> > >  }
> > >
> > >  static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> > > --
> > > 2.26.3
> > >