[Intel-gfx] [PATCH 7/7] drm/i915/gem: Acquire all vma/objects under reservation_ww_class

Chris Wilson chris at chris-wilson.co.uk
Wed Jun 24 10:48:05 UTC 2020


Quoting Thomas Hellström (Intel) (2020-06-24 10:50:08)
> 
> On 6/24/20 10:08 AM, Chris Wilson wrote:
> > Quoting Thomas Hellström (Intel) (2020-06-24 06:42:33)
> >> On 6/23/20 11:15 PM, Chris Wilson wrote:
> >>> Quoting Thomas Hellström (Intel) (2020-06-23 21:31:38)
> >>>> On 6/23/20 8:41 PM, Chris Wilson wrote:
> >>>>> Quoting Thomas Hellström (Intel) (2020-06-23 19:21:28)
> >>>>>> On 6/23/20 6:36 PM, Chris Wilson wrote:
> >>>>>>> Quoting Thomas Hellström (Intel) (2020-06-23 12:22:11)
> >>>>>>>> Hi, Chris,
> >>>>>>>>
> >>>>>>>> On 6/22/20 11:59 AM, Chris Wilson wrote:
> >>>>>>>>> In order to actually handle eviction and what not, we need to process
> >>>>>>>>> all the objects together under a common lock, reservation_ww_class. As
> >>>>>>>>> such, do a memory reservation pass after looking up the object/vma,
> >>>>>>>>> which then feeds into the rest of execbuf [relocation, cmdparsing,
> >>>>>>>>> flushing and ofc execution].
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> >>>>>>>>> ---
> >>>>>>>>>       .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 91 ++++++++++++++-----
> >>>>>>>>>       1 file changed, 70 insertions(+), 21 deletions(-)
> >>>>>>>>>
> >>>>>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> >>>>>>>>> index 46fcbdf8161c..8db2e013465f 100644
> >>>>>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> >>>>>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> >>>>>>>>> @@ -53,10 +53,9 @@ struct eb_vma_array {
> >>>>>>>>>       
> >>>>>>>>>       #define __EXEC_OBJECT_HAS_PIN               BIT(31)
> >>>>>>>>>       #define __EXEC_OBJECT_HAS_FENCE             BIT(30)
> >>>>>>>>> -#define __EXEC_OBJECT_HAS_PAGES              BIT(29)
> >>>>>>>>> -#define __EXEC_OBJECT_NEEDS_MAP              BIT(28)
> >>>>>>>>> -#define __EXEC_OBJECT_NEEDS_BIAS     BIT(27)
> >>>>>>>>> -#define __EXEC_OBJECT_INTERNAL_FLAGS (~0u << 27) /* all of the above */
> >>>>>>>>> +#define __EXEC_OBJECT_NEEDS_MAP              BIT(29)
> >>>>>>>>> +#define __EXEC_OBJECT_NEEDS_BIAS     BIT(28)
> >>>>>>>>> +#define __EXEC_OBJECT_INTERNAL_FLAGS (~0u << 28) /* all of the above */
> >>>>>>>>>       
> >>>>>>>>>       #define __EXEC_HAS_RELOC    BIT(31)
> >>>>>>>>>       #define __EXEC_INTERNAL_FLAGS       (~0u << 31)
> >>>>>>>>> @@ -241,6 +240,8 @@ struct i915_execbuffer {
> >>>>>>>>>           struct intel_context *context; /* logical state for the request */
> >>>>>>>>>           struct i915_gem_context *gem_context; /** caller's context */
> >>>>>>>>>       
> >>>>>>>>> +     struct dma_fence *mm_fence;
> >>>>>>>>> +
> >>>>>>>>>           struct i915_request *request; /** our request to build */
> >>>>>>>>>           struct eb_vma *batch; /** identity of the batch obj/vma */
> >>>>>>>>>           struct i915_vma *trampoline; /** trampoline used for chaining */
> >>>>>>>>> @@ -331,12 +332,7 @@ static inline void eb_unreserve_vma(struct eb_vma *ev)
> >>>>>>>>>           if (ev->flags & __EXEC_OBJECT_HAS_PIN)
> >>>>>>>>>                   __i915_vma_unpin(vma);
> >>>>>>>>>       
> >>>>>>>>> -     if (ev->flags & __EXEC_OBJECT_HAS_PAGES)
> >>>>>>>>> -             i915_gem_object_unpin_pages(vma->obj);
> >>>>>>>>> -
> >>>>>>>>> -     ev->flags &= ~(__EXEC_OBJECT_HAS_PIN |
> >>>>>>>>> -                    __EXEC_OBJECT_HAS_FENCE |
> >>>>>>>>> -                    __EXEC_OBJECT_HAS_PAGES);
> >>>>>>>>> +     ev->flags &= ~(__EXEC_OBJECT_HAS_PIN | __EXEC_OBJECT_HAS_FENCE);
> >>>>>>>>>       }
> >>>>>>>>>       
> >>>>>>>>>       static void eb_vma_array_destroy(struct kref *kref)
> >>>>>>>>> @@ -667,6 +663,55 @@ eb_add_vma(struct i915_execbuffer *eb,
> >>>>>>>>>           list_add_tail(&ev->lock_link, &eb->lock);
> >>>>>>>>>       }
> >>>>>>>>>       
> >>>>>>>>> +static int eb_vma_get_pages(struct i915_execbuffer *eb,
> >>>>>>>>> +                         struct eb_vma *ev,
> >>>>>>>>> +                         u64 idx)
> >>>>>>>>> +{
> >>>>>>>>> +     struct i915_vma *vma = ev->vma;
> >>>>>>>>> +     int err;
> >>>>>>>>> +
> >>>>>>>>> +     /* XXX also preallocate PD for vma */
> >>>>>>>>> +
> >>>>>>>>> +     err = ____i915_gem_object_get_pages_async(vma->obj);
> >>>>>>>>> +     if (err)
> >>>>>>>>> +             return err;
> >>>>>>>>> +
> >>>>>>>>> +     return i915_active_ref(&vma->obj->mm.active, idx, eb->mm_fence);
> >>>>>>>>> +}
> >>>>>>>>> +
> >>>>>>>>> +static int eb_reserve_mm(struct i915_execbuffer *eb)
> >>>>>>>>> +{
> >>>>>>>>> +     const u64 idx = eb->context->timeline->fence_context;
> >>>>>>>>> +     struct ww_acquire_ctx acquire;
> >>>>>>>>> +     struct eb_vma *ev;
> >>>>>>>>> +     int err;
> >>>>>>>>> +
> >>>>>>>>> +     eb->mm_fence = __dma_fence_create_proxy(0, 0);
> >>>>>>>>> +     if (!eb->mm_fence)
> >>>>>>>>> +             return -ENOMEM;
> >>>>>>>> Question: eb is local to this thread, right, so eb->mm_fence is not
> >>>>>>>> considered "published" yet?
> >>>>>>>>
> >>>>>>>>> +
> >>>>>>>>> +     ww_acquire_init(&acquire, &reservation_ww_class);
> >>>>>>>>> +
> >>>>>>>>> +     err = eb_lock_vma(eb, &acquire);
> >>>>>>>>> +     if (err)
> >>>>>>>>> +             goto out;
> >>>>>>>>> +
> >>>>>>>>> +     ww_acquire_done(&acquire);
> >>>>>>>>> +
> >>>>>>>>> +     list_for_each_entry(ev, &eb->lock, lock_link) {
> >>>>>>>>> +             struct i915_vma *vma = ev->vma;
> >>>>>>>>> +
> >>>>>>>>> +             if (err == 0)
> >>>>>>>>> +                     err = eb_vma_get_pages(eb, ev, idx);
> >>>>>>>> I figure this is where you publish the proxy fence? If so, the fence
> >>>>>>>> signaling critical path starts with this loop,
> >>>>>>> Hmm, actually at this moment, the fence is still very much internal
> >>>>>>> being only used as a reference token,
> >>>>>> I think as long as another thread, running in this driver or another gpu
> >>>>>> driver can theoretically reference the fence pointer from the
> >>>>>> reservation object and wait for the fence it's considered published.
> >>>>> It's not in the reservation object.
> >>>>>     
> >>>>>> Also the ww_mutexes in this context are really all about grabbing a
> >>>>>> random set of resources and associate them with a point in a timeline,
> >>>>>> as the ww_mutexes are released, the fence pointer(s) need to point to
> >>>>>> published fence(s).
> >>>>> That's not the purpose of these fences, though. They exist to provide
> >>>>> reference counting on the backing store, along side the migration fence.
> >>>>> It's extra detail tacked on the equivalent of bo->moving.
> >>>>>
> >>>>> That is not to say that one could build up an async migration chain which
> >>>>> form a graph back to these, that chain could only be formed once the
> >>>>> operation itself has been published in the dma_resv though.
> >>>> Hmm. So let's say another thread grabs one of the just released
> >>>> ww_mutexes and wants to schedule a blit from one of the buffers in the
> >>>> current operation with high priority. How would that thread know how to
> >>>> order that blit operation w r t the current operation?
> >>> Why would it order?
> >> So let's say it's an eviction blit, needing to incorporate the data from
> >> the current operation. Or, for that matter a ttm-style cpu copy eviction:
> >>
> >> ww_mutex_lock
> >> wait_for_idle
> >> copy
> >> ww_mutex_unlock
> > We have a scheduler. Eviction does not block. Submission never blocks.
> So regardless if we block or not, how does the scheduler know how to 
> order the eviction blit after the current operation? Wouldn't it need to 
> look at the proxy fence to determine that? Basically I'm trying to get 
> an understanding where the fence signaling critical section starts.

Yes, via the eviction logic but that is only applicable to evictions
within the critical section, and the easiest way to circumvent that is
not to allow evictions within that region; that is evictions can only be
scheduled en masse and not piecemeal. [All the same rules as pinning
applies, since it is the same...]

There is no disagreement in that ideally all reservations must be
performed upfront. The issue is quite simply that we do not know all the
reservations we will need up front -- there are many sequences which we
can offload to the GPU but require arbitrary allocations to do so. For
that, what I was expecting was to try to create requests without
eviction (akin to GFP_ATOMIC), on running out of space commit what has
been completed so far, and rolling back to reacquiring all objects plus
a reserved mempool and continuing on, as many times as required to
complete handing the payload to the GPU. (Experience might say that we start
off with a reservation for a mempool in addition to the user and PD payload.)

The other issue is that some objects are not trivially evictable, those
that are in active use by the HW and have an implicit write after the
fence. [And when that write occurs is unknown as we are expected to
treat that as part of a black box, so we only know for certain after the
fact as another context's request is signaled.] They also have a
placeholder for a fence that is inserted by what is essentially a GC
sweep. [That fence can be inserted on demand, hence the special
perma-pinned kernel context to ensure that we can always force a context
switch, also used for power management of the engines.] It's certainly a
lot simpler if we can avoid including those as part of the eviction pass,
only removing them when idle and avoid exposing them ever to a wide
lock. It certainly needs to be treated with care to avoid regressing the
driver.
-Chris


More information about the Intel-gfx mailing list