[Intel-gfx] [PATCH 7/7] drm/i915/gem: Acquire all vma/objects under reservation_ww_class

Tue Jun 23 15:09:08 UTC 2020

On 6/23/20 4:01 PM, Chris Wilson wrote:
> Quoting Thomas Hellström (Intel) (2020-06-23 13:57:06)
>> On 6/23/20 1:22 PM, Thomas Hellström (Intel) wrote:
>>> Hi, Chris,
>>>
>>> On 6/22/20 11:59 AM, Chris Wilson wrote:
>>>> In order to actually handle eviction and what not, we need to process
>>>> all the objects together under a common lock, reservation_ww_class. As
>>>> such, do a memory reservation pass after looking up the object/vma,
>>>> which then feeds into the rest of execbuf [relocation, cmdparsing,
>>>> flushing and ofc execution].
>>>>
>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>>> ---
>>>>    .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 91 ++++++++++++++-----
>>>>    1 file changed, 70 insertions(+), 21 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
>>>> b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
>>>> index 46fcbdf8161c..8db2e013465f 100644
>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
>>>> @@ -53,10 +53,9 @@ struct eb_vma_array {
>>>>      #define __EXEC_OBJECT_HAS_PIN        BIT(31)
>>>>    #define __EXEC_OBJECT_HAS_FENCE        BIT(30)
>>>> -#define __EXEC_OBJECT_HAS_PAGES        BIT(29)
>>>> -#define __EXEC_OBJECT_NEEDS_MAP        BIT(28)
>>>> -#define __EXEC_OBJECT_NEEDS_BIAS    BIT(27)
>>>> -#define __EXEC_OBJECT_INTERNAL_FLAGS    (~0u << 27) /* all of the
>>>> above */
>>>> +#define __EXEC_OBJECT_NEEDS_MAP        BIT(29)
>>>> +#define __EXEC_OBJECT_NEEDS_BIAS    BIT(28)
>>>> +#define __EXEC_OBJECT_INTERNAL_FLAGS    (~0u << 28) /* all of the
>>>> above */
>>>>      #define __EXEC_HAS_RELOC    BIT(31)
>>>>    #define __EXEC_INTERNAL_FLAGS    (~0u << 31)
>>>> @@ -241,6 +240,8 @@ struct i915_execbuffer {
>>>>        struct intel_context *context; /* logical state for the request */
>>>>        struct i915_gem_context *gem_context; /** caller's context */
>>>>    +    struct dma_fence *mm_fence;
>>>> +
>>>>        struct i915_request *request; /** our request to build */
>>>>        struct eb_vma *batch; /** identity of the batch obj/vma */
>>>>        struct i915_vma *trampoline; /** trampoline used for chaining */
>>>> @@ -331,12 +332,7 @@ static inline void eb_unreserve_vma(struct
>>>> eb_vma *ev)
>>>>        if (ev->flags & __EXEC_OBJECT_HAS_PIN)
>>>>            __i915_vma_unpin(vma);
>>>>    -    if (ev->flags & __EXEC_OBJECT_HAS_PAGES)
>>>> -        i915_gem_object_unpin_pages(vma->obj);
>>>> -
>>>> -    ev->flags &= ~(__EXEC_OBJECT_HAS_PIN |
>>>> -               __EXEC_OBJECT_HAS_FENCE |
>>>> -               __EXEC_OBJECT_HAS_PAGES);
>>>> +    ev->flags &= ~(__EXEC_OBJECT_HAS_PIN | __EXEC_OBJECT_HAS_FENCE);
>>>>    }
>>>>      static void eb_vma_array_destroy(struct kref *kref)
>>>> @@ -667,6 +663,55 @@ eb_add_vma(struct i915_execbuffer *eb,
>>>>        list_add_tail(&ev->lock_link, &eb->lock);
>>>>    }
>>>>    +static int eb_vma_get_pages(struct i915_execbuffer *eb,
>>>> +                struct eb_vma *ev,
>>>> +                u64 idx)
>>>> +{
>>>> +    struct i915_vma *vma = ev->vma;
>>>> +    int err;
>>>> +
>>>> +    /* XXX also preallocate PD for vma */
>>>> +
>>>> +    err = ____i915_gem_object_get_pages_async(vma->obj);
>>>> +    if (err)
>>>> +        return err;
>>>> +
>>>> +    return i915_active_ref(&vma->obj->mm.active, idx, eb->mm_fence);
>>>> +}
>>>> +
>>>> +static int eb_reserve_mm(struct i915_execbuffer *eb)
>>>> +{
>>>> +    const u64 idx = eb->context->timeline->fence_context;
>>>> +    struct ww_acquire_ctx acquire;
>>>> +    struct eb_vma *ev;
>>>> +    int err;
>>>> +
>>>> +    eb->mm_fence = __dma_fence_create_proxy(0, 0);
>>>> +    if (!eb->mm_fence)
>>>> +        return -ENOMEM;
>>> Question: eb is local to this thread, right, so eb->mm_fence is not
>>> considered "published" yet?
>>>
>>>> +
>>>> +    ww_acquire_init(&acquire, &reservation_ww_class);
>>>> +
>>>> +    err = eb_lock_vma(eb, &acquire);
>>>> +    if (err)
>>>> +        goto out;
>>>> +
>>>> +    ww_acquire_done(&acquire);
>>>> +
>>>> +    list_for_each_entry(ev, &eb->lock, lock_link) {
>>>> +        struct i915_vma *vma = ev->vma;
>>>> +
>>>> +        if (err == 0)
>>>> +            err = eb_vma_get_pages(eb, ev, idx);
>>> I figure this is where you publish the proxy fence? If so, the fence
>>> signaling critical path starts with this loop, and that means any code
>>> we call between here and submission complete (including spawned work
>>> we need to wait for before submission) may not lock the
>>> reservation_ww_class nor (still being discussed) allocate memory.
> Yes, at this point we have reserved the memory for the execbuf.
>
>>> It
>>> looks like i915_pin_vma takes a reservation_ww_class. And all memory
>>> pinning seems to be in the fence critical path as well?
> Correct, it's not meant to be waiting inside i915_vma_pin(); the
> intention was to pass in memory, and then we would not need to
> do the acquire ourselves. As we have just reserved the memory in the
> above loop, this should not be an issue. I was trying to keep the
> change minimal and allow incremental conversions. It does however need
> to add a reference to the object for the work it spawns -- equally
> though there is an async eviction pass later in execbuf. The challenge
> here is that the greedy grab of bound vma is faster than doing the
> unbound eviction handling (even when eviction is not required).

So for the i915_vma_pin, it looks like

fence_critical_start(eb_reserve_mm) -> 
dma_resv_lock_interruptible(i915_vma_pin) -> lockdep issue.

You can't take the dma_resv_lock inside a fence critical section.

And for the memory allocation, it looks like the fence is published in 
the first loop iteration, starting the critical section, meaning that 
any memory allocation that follows will cause a lockdep issue. That 
includes worker threads. (with the proposed dma_fence annotations).

>
>> And I think even if we at some point end up with the allocation
>> annotation the other way around, allowing memory allocations in fence
>> signalling critical paths, both relocations and userpointer would cause
>> lockdep problems because of
>>
>> mmap_sem->reservation_object->fence_wait (fault handlers, lockdep priming)
> We don't wait inside mmap_sem. One cannot, you do not know the locking
> context, so you can only try to reclaim idle space. So you end up with
> the issue of a multitude of threads each trying to claim the last slice
> of the aperture/backing storage, not being able to directly reclaim and
> so have to hit the equivalent of kswapd.

I don't think I follow you here. There are a number of drivers that wait 
for dma_fences inside the fault handlers with mmap_sem held for data to 
be migrated before the pte is set up.

>
>> vs
>> fence_critical->gup/copy_from_user->mmap_sem
> Which exists today, even the busy wait loop is implicit linkage; you only
> need userspace to be holding a resource on the gpu to create the deadlock.
> I've been using the userfault handler to develop test cases where we can
> arbitrarily block the userptr.

Yes but in a case where we don't publish the fence early, the above 
would be reduced to the well known reservation_ww_class vs mmap_sem 
lockdep issue, which other drivers seem to have solved and we could copy 
what they've done.

/Thomas

> -Chris