Implement svm without BO concept in xe driver

Mon Aug 21 19:41:19 UTC 2023

> -----Original Message-----
> From: dri-devel <dri-devel-bounces at lists.freedesktop.org> On Behalf Of Felix
> Kuehling
> Sent: August 21, 2023 3:18 PM
> To: Zeng, Oak <oak.zeng at intel.com>; Dave Airlie <airlied at gmail.com>
> Cc: Brost, Matthew <matthew.brost at intel.com>; Thomas Hellström
> <thomas.hellstrom at linux.intel.com>; Philip Yang <Philip.Yang at amd.com>;
> Welty, Brian <brian.welty at intel.com>; dri-devel at lists.freedesktop.org; Christian
> König <christian.koenig at amd.com>; Vishwanathapura, Niranjana
> <niranjana.vishwanathapura at intel.com>; intel-xe at lists.freedesktop.org
> Subject: Re: Implement svm without BO concept in xe driver
> 
> 
> On 2023-08-21 11:10, Zeng, Oak wrote:
> > Accidently deleted Brian. Add back.
> >
> > Thanks,
> > Oak
> >
> >> -----Original Message-----
> >> From: Zeng, Oak
> >> Sent: August 21, 2023 11:07 AM
> >> To: Dave Airlie <airlied at gmail.com>
> >> Cc: Brost, Matthew <matthew.brost at intel.com>; Thomas Hellström
> >> <thomas.hellstrom at linux.intel.com>; Philip Yang <Philip.Yang at amd.com>;
> Felix
> >> Kuehling <felix.kuehling at amd.com>; dri-devel at lists.freedesktop.org; intel-
> >> xe at lists.freedesktop.org; Vishwanathapura, Niranjana
> >> <niranjana.vishwanathapura at intel.com>; Christian König
> >> <christian.koenig at amd.com>
> >> Subject: RE: Implement svm without BO concept in xe driver
> >>
> >>> -----Original Message-----
> >>> From: dri-devel <dri-devel-bounces at lists.freedesktop.org> On Behalf Of
> Dave
> >>> Airlie
> >>> Sent: August 20, 2023 6:21 PM
> >>> To: Zeng, Oak <oak.zeng at intel.com>
> >>> Cc: Brost, Matthew <matthew.brost at intel.com>; Thomas Hellström
> >>> <thomas.hellstrom at linux.intel.com>; Philip Yang <Philip.Yang at amd.com>;
> >> Felix
> >>> Kuehling <felix.kuehling at amd.com>; Welty, Brian <brian.welty at intel.com>;
> >> dri-
> >>> devel at lists.freedesktop.org; intel-xe at lists.freedesktop.org;
> Vishwanathapura,
> >>> Niranjana <niranjana.vishwanathapura at intel.com>; Christian König
> >>> <christian.koenig at amd.com>
> >>> Subject: Re: Implement svm without BO concept in xe driver
> >>>
> >>> On Thu, 17 Aug 2023 at 12:13, Zeng, Oak <oak.zeng at intel.com> wrote:
> >>>>> -----Original Message-----
> >>>>> From: Dave Airlie <airlied at gmail.com>
> >>>>> Sent: August 16, 2023 6:52 PM
> >>>>> To: Felix Kuehling <felix.kuehling at amd.com>
> >>>>> Cc: Zeng, Oak <oak.zeng at intel.com>; Christian König
> >>>>> <christian.koenig at amd.com>; Thomas Hellström
> >>>>> <thomas.hellstrom at linux.intel.com>; Brost, Matthew
> >>>>> <matthew.brost at intel.com>; maarten.lankhorst at linux.intel.com;
> >>>>> Vishwanathapura, Niranjana <niranjana.vishwanathapura at intel.com>;
> >> Welty,
> >>>>> Brian <brian.welty at intel.com>; Philip Yang <Philip.Yang at amd.com>;
> intel-
> >>>>> xe at lists.freedesktop.org; dri-devel at lists.freedesktop.org
> >>>>> Subject: Re: Implement svm without BO concept in xe driver
> >>>>>
> >>>>> On Thu, 17 Aug 2023 at 08:15, Felix Kuehling <felix.kuehling at amd.com>
> >>> wrote:
> >>>>>> On 2023-08-16 13:30, Zeng, Oak wrote:
> >>>>>>> I spoke with Thomas. We discussed two approaches:
> >>>>>>>
> >>>>>>> 1) make ttm_resource a central place for vram management functions
> >>> such as
> >>>>> eviction, cgroup memory accounting. Both the BO-based driver and BO-
> less
> >>> SVM
> >>>>> codes call into ttm_resource_alloc/free functions for vram allocation/free.
> >>>>>>>       *This way BO driver and SVM driver shares the eviction/cgroup logic,
> >> no
> >>>>> need to reimplment LRU eviction list in SVM driver. Cgroup logic should be
> >> in
> >>>>> ttm_resource layer. +Maarten.
> >>>>>>>       *ttm_resource is not a perfect match for SVM to allocate vram. It is
> >> still
> >>> a
> >>>>> big overhead. The *bo* member of ttm_resource is not needed for SVM -
> >>> this
> >>>>> might end up with invasive changes to ttm...need to look into more details
> >>>>>> Overhead is a problem. We'd want to be able to allocate, free and evict
> >>>>>> memory at a similar granularity as our preferred migration and page
> >>>>>> fault granularity, which defaults to 2MB in our SVM implementation.
> >>>>>>
> >>>>>>
> >>>>>>> 2) svm code allocate memory directly from drm-buddy allocator, and
> >>> expose
> >>>>> memory eviction functions from both ttm and svm so they can evict
> >> memory
> >>>>> from each other. For example, expose the ttm_mem_evict_first function
> >>> from
> >>>>> ttm side so hmm/svm code can call it; expose a similar function from svm
> >> side
> >>> so
> >>>>> ttm can evict hmm memory.
> >>>>>> I like this option. One thing that needs some thought with this is how
> >>>>>> to get some semblance of fairness between the two types of clients.
> >>>>>> Basically how to choose what to evict. And what share of the available
> >>>>>> memory does each side get to use on average. E.g. an idle client may get
> >>>>>> all its memory evicted while a busy client may get a bigger share of the
> >>>>>> available memory.
> >>>>> I'd also like to suggest we try to write any management/generic code
> >>>>> in driver agnostic way as much as possible here. I don't really see
> >>>>> much hw difference should be influencing it.
> >>>>>
> >>>>> I do worry about having effectively 2 LRUs here, you can't really have
> >>>>> two "leasts".
> >>>>>
> >>>>> Like if we hit the shrinker paths who goes first? do we shrink one
> >>>>> object from each side in turn?
> >>>> One way to solve this fairness problem is to create a driver agnostic
> >>> drm_vram_mgr. Maintain a single LRU in drm_vram_mgr. Move the memory
> >>> eviction/cgroups memory accounting logic from ttm_resource manager to
> >>> drm_vram_mgr. Both BO-based driver and SVM driver calls to
> drm_vram_mgr
> >> to
> >>> allocate/free memory.
> >>>> I am not sure whether this meets the 2M allocate/free/evict granularity
> >>> requirement Felix mentioned above. SVM can allocate 2M size blocks. But
> BO
> >>> driver should be able to allocate any arbitrary sized blocks - So the eviction is
> >> also
> >>> arbitrary size.
> >>>>> Also will we have systems where we can expose system SVM but
> userspace
> >>>>> may choose to not use the fine grained SVM and use one of the older
> >>>>> modes, will that path get emulated on top of SVM or use the BO paths?
> >>>> If by "older modes" you meant the gem_bo_create (such as
> xe_gem_create
> >> or
> >>> amdgpu_gem_create), then today both amd and intel implement those
> >>> interfaces using BO path. We don't have a plan to emulate that old mode on
> >> tope
> >>> of SVM, afaict.
> >>>
> >>> I'm not sure how the older modes manifest in the kernel I assume as bo
> >>> creates (but they may use userptr), SVM isn't a specific thing, it's a
> >>> group of 3 things.
> >>>
> >>> 1) coarse-grained SVM which I think is BO
> >>> 2) fine-grained SVM which is page level
> >>> 3) fine-grained system SVM which is HMM
> >>>
> >>> I suppose I'm asking about the previous versions and how they would
> >>> operate in a system SVM capable system.
> >> I got your question now.
> >>
> >> As I understand it, the system SVM provides similar functionality as BO-based
> >> SVM (i.e., share virtual address space b/t cpu and gpu program, no explicit
> >> memory placement for gpu program), but they have different user interface
> >> (malloc, mmap vs bo create, vm bind).
> >>
> >>  From functionality perspective, on a system SVM capable system, we don't
> need
> >> #1/#2. Once #3 is implemented and turned out be as performant as #1/#2, we
> >> can ask user space to switch to #3.
> >>
> >> As far as I know, AMD doesn't have #1/#2 - their BO-based driver *requires*
> all
> >> valid GPU virtual address be mapped to GPU page table *before* GPU kernel
> >> submission, aka a GPU page fault is treated as fatal. Felix please fix me, as my
> >> AMD knowledge is fading away...
> 
> This is correct. The AMD driver can only handle recoverable page faults
> for virtual address ranges managed by system SVM.
> 
> That said, if you want performance parity with #1/#2, you basically need
> to pre-fault things even for #3. Device page faults just add too much
> latency.
> 
> 
> >>
> >>  From interface perspective, i.e., to keep UMD which using #1/#2 continue to
> run
> >> without modification, we need #1/#2 to continue exist.
> 
> Agreed.
> 
> 
> >>
> >> Should we emulate #1/#2 on top of #3? I feel the BO-based memory
> >> management and the struct page/hmm based memory management are
> quite
> >> different design philosophy. Trying to emulate one on top of another can run
> into
> >> serious difficulty. For example, how do we emulate a vm_bind on top of #3?
> >> Remember for #1/#2 virtual address space is managed by user space while #3
> 
> >> virtual address space is managed by kernel core mm (vma struct...). It is a hard
> >> conflict here...
> 
> I have thought about emulating BO allocation APIs on top of system SVM.
> This was in the context of KFD where memory management is not tied into
> command submissions APIs, which would add a whole other layer of
> complexity. The main unsolved (unsolvable?) problem I ran into was, that
> there is no way to share SVM memory as DMABufs. So there is no good way
> to support applications that expect to share memory in that way.

Great point. I also discussed the dmabuf thing with Mike (cc'ed). dmabuf is a particular technology created specially for the BO driver (and other driver) to share buffer b/t devices. Hmm/system SVM doesn't need this technology: malloc'ed memory by the nature is already shared b/t different devices (in one process) and CPU. We just can simply submit GPU kernel to all devices with malloc'ed memory and let kmd decide the memory placement (such as map in place or migrate). No need of buffer export/import in hmm/system SVM world.

So yes from buffer sharing perspective, the design philosophy is also very different.

Thanks,
Oak

> 
> You may be able to emulate that with file-backed memory (e.g. POSIX
> IPC), but it adds overhead at allocation time and HMM migration
> currently doesn't work with file-backed memory.
> 
> Regards,
>    Felix
> 
> 
> >>
> >> Thanks again for the great question!
> >> Oak
> >>
> >>> Dave.
> >>>> Thanks,
> >>>> Oak
> >>>>
> >>>>> Dave.