[PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

Wed Feb 26 16:32:36 UTC 2020

On Tue, Feb 25, 2020 at 6:16 PM Daniel Vetter <daniel at ffwll.ch> wrote:
>
> On Mon, Feb 24, 2020 at 07:46:59PM +0100, Christian König wrote:
> > Am 23.02.20 um 17:54 schrieb Thomas Hellström (VMware):
> > > On 2/23/20 4:45 PM, Christian König wrote:
> > > > Am 21.02.20 um 18:12 schrieb Daniel Vetter:
> > > > > [SNIP]
> > > > > Yeah the Great Plan (tm) is to fully rely on ww_mutex slowly
> > > > > degenerating
> > > > > into essentially a global lock. But only when there's actual contention
> > > > > and thrashing.
> > > >
> > > > Yes exactly. A really big problem in TTM is currently that we drop
> > > > the lock after evicting BOs because they tend to move in again
> > > > directly after that.
> > > >
> > > > From practice I can also confirm that there is exactly zero benefit
> > > > from dropping locks early and reacquire them for example for the VM
> > > > page tables. That's just makes it more likely that somebody needs to
> > > > roll back and this is what we need to avoid in the first place.
> > >
> > > If you have a benchmarking setup available it would be very interesting
> > > for future reference to see how changing from WD to WW mutexes affects
> > > the roll back frequency. WW is known to cause rollbacks much less
> > > frequently but there is more work associated with each rollback.
> >
> > Not of hand. To be honest I still have a hard time to get a grip on the
> > difference between WD and WW from the algorithm point of view. So I can't
> > judge that difference at all.
> >
> > > > Contention on BO locks during command submission is perfectly fine
> > > > as long as this is as lightweight as possible while we don't have
> > > > trashing. When we have trashing multi submission performance is best
> > > > archived to just favor a single process to finish its business and
> > > > block everybody else.
> > >
> > > Hmm. Sounds like we need a per-manager ww_rwsem protecting manager
> > > allocation, taken in write-mode then there's thrashing. In read-mode
> > > otherwise. That would limit the amount of "unnecessary" locks we'd have
> > > to keep and reduce unwanted side-effects, (see below):
> >
> > Well per-manager (you mean per domain here don't you?) doesn't sound like
> > that useful because we rarely use only one domain, but I'm actually
> > questioning for quite a while if the per BO lock scheme was the right
> > approach.
> >
> > See from the performance aspect the closest to ideal solution I can think of
> > would be a ww_rwsem per user of a resource.
> >
> > In other words we don't lock BOs, but instead a list of all their users and
> > when you want to evict a BO you need to walk that list and inform all users
> > that the BO will be moving.
> >
> > During command submission you then have the fast path which rather just
> > grabs the read side of the user lock and check if all BOs are still in the
> > expected place.
> >
> > If some BOs were evicted you back off and start the slow path, e.g. maybe
> > even copy additional data from userspace then grab the write side of the
> > lock etc.. etc...
> >
> > That approach is similar to what we use in amdgpu with the per-VM BOs, but
> > goes a step further. Problem is that we are so used to per BO locks in the
> > kernel that this is probably not doable any more.
>
> Yeah I think it'd be nice to have the same approach for shared bo too. I
> guess what we could do is something like this (spinning your ww_rwmutex
> idea a bit further):
>
> dma_buf_read_lock(buf, vm)
> {
>         if (enabled(CONFIG_DEBUG_WW_MUTEX_SLOWPATH))
>         {
>                 check that vm is indeed listed in buf and splat if not
>         }
>
>         /* for a buf that's not shared in multiple vm we'd have buf->resv
>          * == vm->resv here */
>         return ww_mutex_lock(vm->resv);
> }
>
> dma_buf_write_lock(buf)
> {
>         for_each_vm_in_buf(buf, vm) {
>                 ww_mutex_lock(vm->resv);
>         }
> }
>
> Ideally we'd track all these vms with something slightly less shoddy than
> a linked list :-) Resizeable array is probably pretty good, I think we
> only ever need to go from buf -> vm list, not the other way round. At
> least in dma_resv/dma_buf code, driver code ofc needs to keep a list of
> all bo bound to a vm somewhere. But that's probably a much bigger
> datastructure for tracking vma offsets and mappings and other things on
> top.
>
> Ofc to even just get there we'd need something like the sublock list to
> keep track of all the additional locks we'd need for the writer lock. And
> we'd need the release callback for backoff, so that we could also go
> through the slowpath on a vm object that we're not holding a full
> reference on. That also means vm need to be refcounted.
>
> And the list of vms on a buffer need to be protected with some lock and
> the usual kref_get_unless_zero trickery.
>
> But with all this I think we can make the dma_buf_write_lock lock 100%
> like the old per-buffer lock for everyone. And execbuf could switch over
> to dma_buf_read_lock for shared buffers. Bonus points when the gpu context
> just keeps track of a list of shared vm used by buffers in that context
> ...
>
> That way we could make vm fastpath locking a la amdgpu opt-in, while
> keeping everyone else on the per-object locking juices.
>
> Thoughts?

One thing I just realized, which is nasty: The full (write) lock needs
ww_acquire_ctx with this, because it needs to take a bunch of locks.
Rolling that out everywhere is going to be nasty.

I guess though we could do a fallback and have a locally created
ww_acquire_ctx if there's none passed in, with backoff entirely
implemented within dma_resv_lock.
-Daniel

>
> Cheers, Daniel
>
> PS: Of course the write lock for these buffers is going to be terrible, so
> every time you need to update fences for implicit sync on shared buffer
> (well write fence at least) it's going to suck. We probably also want a
> read_to_write_upgrade function, which also can be done easily with
> ww_mutex magic.
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch