[RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory

Fri Aug 30 09:57:33 UTC 2024

Hi, Matthew,

Agreed the below might not be important just now, but some ideas:

On Thu, 2024-08-29 at 20:56 +0000, Matthew Brost wrote:
> Issues with removing a SVM range:
> 
> - Xe bind code stores invalidation / present state in VMA, this would
>   need to be moved to the radix tree. I have Jira open for that work
>   which I believe other developers are going to own.

Yeah, although we shouldn't *design* around xe bind-code and page-table
code shortcomings.

> - Where would the dma mapping / device pages be stored?
> 	- In the radix tree? What if ATS is enabled? We don't have a
> 	  driver owned radix tree. How do we reasonably connect a
> driver
> 	  owned radix to a common GPUSVM layer?

With ATS you mean IOMMU SVA, right? I think we could assume that any
user of this code also has a gpu page-table since otherwise they
couldn't be using VRAM and a simpler solution would be in place. 

But to that specific question, drm_gpusvm state would live in a
drm_gpusvm radix tree and driver-specific stuff in the driver tree. A
helper based approach would then call drm_gpusvm_unmap_dma(range),
whereas a middle layer would just traverse the tree and unmap.

> 	- In the notifier? What is the notifier is sparsely
> populated?
> 	  We would be wasting huge amounts of memory. What is the
> 	  notifier is configured to span the entire virtual address
> 	  space?

Let's assume you use a fake page-table like in xe_pt_walk.c as your
"radix tree", adapted to relevant page-sizes, sparsity is not a
problem.

> - How does the garbage collector work? We can't allocate memory in
> the
>   notifier so we don't anything to add to the garbage collector. We
>   can't directly modify page tables given you need lock in the path
> of
>   reclaim.

The garbage collector would operate on the whole invalidated range. In
the case of xe, upon zapping under reclaim you mark individual page-
table bos that are to be removed as "invalid", the garbage collector
walks the range removing the "invalid" entries. Subsequent (re-binding)
avoids the "invalid" entries, (perhaps even helps removing them) and
can thus race with the garbage collector. Hence, any ranges implied by
the page-table code are elimitated.

> - How do we deal with fault storms (e.g. tons of faults hitting the
> same
>   SVM range in a row)? Without a SVM range no every to know if
> mapping
>   is valid and GPU page handler can be short circuited.

Perhaps look at page-table tree and check whether the gpu_pte causing
the fault is valid.

> - Do we have notifier seqno for every PTE?

I'd say no. With this approach it makes sense to have a wide notifier.
The seqno now only affects binding of new gpu_ptes, so the problem with
a wide notifier becomes that if invalidation occurs to *any* part of
the notifier while we're in the read section during binding, we need to
rerun the binding. Adding more notifiers to mitigate that would be to
optimize faulting performance over core invalidation performance which
Jason asked us to avoid.

/Thomas