[PATCH drm-next 13/14] drm/nouveau: implement new VM_BIND UAPI

Mon Jan 23 10:01:55 UTC 2023

On Sun, 22 Jan 2023 17:48:37 +0000
Matthew Brost <matthew.brost at intel.com> wrote:

> On Fri, Jan 20, 2023 at 11:22:45AM +0100, Boris Brezillon wrote:
> > On Thu, 19 Jan 2023 16:38:06 +0000
> > Matthew Brost <matthew.brost at intel.com> wrote:
> >   
> > > On Thu, Jan 19, 2023 at 04:36:43PM +0100, Danilo Krummrich wrote:  
> > > > On 1/19/23 05:58, Matthew Brost wrote:    
> > > > > On Thu, Jan 19, 2023 at 04:44:23AM +0100, Danilo Krummrich wrote:    
> > > > > > On 1/18/23 21:37, Thomas Hellström (Intel) wrote:    
> > > > > > > 
> > > > > > > On 1/18/23 07:12, Danilo Krummrich wrote:    
> > > > > > > > This commit provides the implementation for the new uapi motivated by the
> > > > > > > > Vulkan API. It allows user mode drivers (UMDs) to:
> > > > > > > > 
> > > > > > > > 1) Initialize a GPU virtual address (VA) space via the new
> > > > > > > >      DRM_IOCTL_NOUVEAU_VM_INIT ioctl for UMDs to specify the portion of VA
> > > > > > > >      space managed by the kernel and userspace, respectively.
> > > > > > > > 
> > > > > > > > 2) Allocate and free a VA space region as well as bind and unbind memory
> > > > > > > >      to the GPUs VA space via the new DRM_IOCTL_NOUVEAU_VM_BIND ioctl.
> > > > > > > >      UMDs can request the named operations to be processed either
> > > > > > > >      synchronously or asynchronously. It supports DRM syncobjs
> > > > > > > >      (incl. timelines) as synchronization mechanism. The management of the
> > > > > > > >      GPU VA mappings is implemented with the DRM GPU VA manager.
> > > > > > > > 
> > > > > > > > 3) Execute push buffers with the new DRM_IOCTL_NOUVEAU_EXEC ioctl. The
> > > > > > > >      execution happens asynchronously. It supports DRM syncobj (incl.
> > > > > > > >      timelines) as synchronization mechanism. DRM GEM object locking is
> > > > > > > >      handled with drm_exec.
> > > > > > > > 
> > > > > > > > Both, DRM_IOCTL_NOUVEAU_VM_BIND and DRM_IOCTL_NOUVEAU_EXEC, use the DRM
> > > > > > > > GPU scheduler for the asynchronous paths.
> > > > > > > > 
> > > > > > > > Signed-off-by: Danilo Krummrich <dakr at redhat.com>
> > > > > > > > ---
> > > > > > > >    Documentation/gpu/driver-uapi.rst       |   3 +
> > > > > > > >    drivers/gpu/drm/nouveau/Kbuild          |   2 +
> > > > > > > >    drivers/gpu/drm/nouveau/Kconfig         |   2 +
> > > > > > > >    drivers/gpu/drm/nouveau/nouveau_abi16.c |  16 +
> > > > > > > >    drivers/gpu/drm/nouveau/nouveau_abi16.h |   1 +
> > > > > > > >    drivers/gpu/drm/nouveau/nouveau_drm.c   |  23 +-
> > > > > > > >    drivers/gpu/drm/nouveau/nouveau_drv.h   |   9 +-
> > > > > > > >    drivers/gpu/drm/nouveau/nouveau_exec.c  | 310 ++++++++++
> > > > > > > >    drivers/gpu/drm/nouveau/nouveau_exec.h  |  55 ++
> > > > > > > >    drivers/gpu/drm/nouveau/nouveau_sched.c | 780 ++++++++++++++++++++++++
> > > > > > > >    drivers/gpu/drm/nouveau/nouveau_sched.h |  98 +++
> > > > > > > >    11 files changed, 1295 insertions(+), 4 deletions(-)
> > > > > > > >    create mode 100644 drivers/gpu/drm/nouveau/nouveau_exec.c
> > > > > > > >    create mode 100644 drivers/gpu/drm/nouveau/nouveau_exec.h
> > > > > > > >    create mode 100644 drivers/gpu/drm/nouveau/nouveau_sched.c
> > > > > > > >    create mode 100644 drivers/gpu/drm/nouveau/nouveau_sched.h    
> > > > > > > ...    
> > > > > > > > 
> > > > > > > > +static struct dma_fence *
> > > > > > > > +nouveau_bind_job_run(struct nouveau_job *job)
> > > > > > > > +{
> > > > > > > > +    struct nouveau_bind_job *bind_job = to_nouveau_bind_job(job);
> > > > > > > > +    struct nouveau_uvmm *uvmm = nouveau_cli_uvmm(job->cli);
> > > > > > > > +    struct bind_job_op *op;
> > > > > > > > +    int ret = 0;
> > > > > > > > +    
> > > > > > > 
> > > > > > > I was looking at how nouveau does the async binding compared to how xe
> > > > > > > does it.
> > > > > > > It looks to me that this function being a scheduler run_job callback is
> > > > > > > the main part of the VM_BIND dma-fence signalling critical section for
> > > > > > > the job's done_fence and if so, needs to be annotated as such?    
> > > > > > 
> > > > > > Yes, that's the case.
> > > > > >     
> > > > > > > 
> > > > > > > For example nouveau_uvma_region_new allocates memory, which is not
> > > > > > > allowed if in a dma_fence signalling critical section and the locking
> > > > > > > also looks suspicious?    
> > > > > > 
> > > > > > Thanks for pointing this out, I missed that somehow.
> > > > > > 
> > > > > > I will change it to pre-allocate new regions, mappings and page tables
> > > > > > within the job's submit() function.
> > > > > >     
> > > > > 
> > > > > Yea that what we basically do in Xe, in the IOCTL step allocate all the
> > > > > backing store for new page tables, populate new page tables (these are
> > > > > not yet visible in the page table structure), and in last step which is
> > > > > executed after all the dependencies are satified program all the leaf
> > > > > entires making the new binding visible.
> > > > > 
> > > > > We screwed have this up by defering most of the IOCTL to a worker but
> > > > > will fix this fix this one way or another soon - get rid of worker or
> > > > > introduce a type of sync that is signaled after the worker + publish the
> > > > > dma-fence in the worker. I'd like to close on this one soon.    
> > > > > > For the ops structures the drm_gpuva_manager allocates for reporting the
> > > > > > split/merge steps back to the driver I have ideas to entirely avoid
> > > > > > allocations, which also is a good thing in respect of Christians feedback
> > > > > > regarding the huge amount of mapping requests some applications seem to
> > > > > > generate.
> > > > > >     
> > > > > 
> > > > > It should be fine to have allocations to report the split/merge step as
> > > > > this step should be before a dma-fence is published, but yea if possible
> > > > > to avoid extra allocs as that is always better.    
> > > > 
> > > > I think we can't really ask for the split/merge steps before actually
> > > > running the job, since it requires the particular VA space not to change
> > > > while performing those operations.
> > > > 
> > > > E.g. if we'd run the split/merge steps at job submit() time the underlying
> > > > VA space could be changed by other bind jobs executing before this one,
> > > > which would make the calculated split/merge steps obsolete and wrong.
> > > >     
> > > 
> > > Hmm, maybe I'm not understanding this implementation, admittedly I
> > > haven't studied the gpuva manager code in detail.
> > > 
> > > Let me explain what we are doing in Xe.
> > > 
> > > Map 0x0000 - 0x3000 -> this resolves into 1 bind operation and 1 VMA
> > > Unmap 0x1000-0x2000 -> this resolves into 1 unbind and 2 rebind operations
> > > 
> > > 1. unbind 0x0000-0x3000 -> destroy old VMA
> > > 2. rebind 0x0000-0x1000 -> new VMA
> > > 3. rebind 0x2000-0x3000 -> new VMA
> > > 
> > > All of the above steps resolving the operations can be done in the IOCTL
> > > phase and VM's VMA structure is also updated. When the dependencies
> > > are resolved the actual bindings are done on the GPU. We use the BO's
> > > dma-resv slots to ensure there is never a window 0x0000-0x1000 and
> > > 0x2000-0x3000 are never mapped with respect to execs (I forget the exact
> > > details of how we do this but if you want to know I'll explain further).
> > >   
> > 
> > Ok, so I've been contemplating the idea of pre-reserving memory for any
> > future page-table updates, so I can guarantee the bind/unbind op in  
> > ->run_job() never fails (that's actually made more complicated in my  
> > case, because we don't directly control the page table updates, but
> > defer that to the iommu/iopagetbl framework which does the allocation,
> > so I didn't really go as far as you did). But with bind ops happening in
> > a queue with dependencies to wait on, guessing what the page-table will
> > look like is a bit challenging. Sure, we can pre-allocate pages for all
> > levels needed to reach the leaf node(s) we're trying to insert or plan
> > for the worst case scenario in case of 2MB -> 4K block splits for
> > partial unmaps, but it sounds like a lot of memory reservation,
> > especially if we get lot of bind requests queued. Just curious to hear
> > how you solved that.  
> 
> A few thing here.
> 
> First the reason we always do an unbind and possibly 2 rebinds is
> because we might crossing the 1GB / 2MB page boundary size. In the
> example I gave we could just unbind 0x1000-0x2000 and call it day but
> we'd have to build smarts into to our code to know that is ok which we
> don't have at this time.
> 
> e.g. This wouldn't ok just to do an unbind as the page size goes from
> 2MB to 4k:
> 
> Map   0x000000-0x200000 -> 2MB page
> Unmap 0x100000-0x200000 -> need 256 4k pages now
> 
> For simplity we always do a unbind + 2 possible rebinds. Maybe we will
> be a little smarter in the future if needed.

Yeah, I see why you're doing that, but it's not really an option in
pancsf, because this is hidden behind the io_pgtable abstract, which
deals with splits/merges of PTEs on its own. So, when we call
io_pgtable_ops->unmap(0x1000-0x2000), what happens in practice is:

1/ allocate an L3 page table
2/ populate the 0x000000-0x001000 and 0x002000-0x200000 range in this
   new L3 page table
3/ update the L2 page table entry to point to the L3 one

In that regards, the io_pgtable abstraction makes it simpler for the
driver, but that also means we don't control memory allocation, and
both map and unmap operations can fail.

> 
> WRT to the memory allocations, we haven't really thought about
> consequences of reserving memory for pending binds. My initial reaction
> is I don't think this an issue. Page table memory should be relatively
> small compared to user memory so having a bunch queue'd up shouldn't
> really be an issue, right? Going from 2MB -> 4k splits on either side is
> just another 2 4k allocs on either side.

Well, yes, for unmap ops, you'd probably need a maximum of 2 pages, but
for a map op, if you need to allocate L1 and L2 tables (they might not
exist when the map request reaches time of execution, and it's pretty
hard to predict the state of the page-table), that's actually more than
just 2 pages. It depends on the size of the mapping actually. And yes,
I agree it's not much compared to the amount of user memory, but if you
have plenty of map/unmap requests queued, pre-reserving those pages can
still consume a non-negligible amount of memory. But maybe we'll have
bigger problems if the amount of queued VM_BIND requests grows to the
point where the amount of memory reserved for future page-table updates
is a problem, dunno.