[PATCH 1/5] drm/amdgpu: allow direct submission in the VM backends

Wed Jul 17 20:31:38 UTC 2019

On 2019-07-17 5:10, Christian König wrote:
> Am 16.07.19 um 18:40 schrieb Kuehling, Felix:
>> On 2019-07-16 9:36 a.m., Christian König wrote:
>>> Am 02.07.19 um 21:35 schrieb Kuehling, Felix:
>>>> This assumes that page tables are resident when a page fault is 
>>>> handled.
>>> Yeah, that is correct. I also haven't completely figured out how we
>>> can prevent the VM from being destroyed while handling the fault.
>> There are other cases I had in mind: Page tables can be evicted. For KFD
>> processes which can be preempted with CWSR, it's possible that a wave
>> that caused a page fault is preempted due to a page table eviction. That
>> means, by the time the page fault is handled, the page table is no
>> longer resident.
>
> This is a corner case we can handle later on. As long as the VM is 
> still alive just allocating page tables again should be sufficient for 
> this.

Do you mean, instead of migrating page tables back, throwing them away 
and allocating a new one?

Also, this may be a corner case. But I feel you're limiting yourself to 
a small range of current use cases. I'm not convinced that the design 
you're building here will scale to future use cases for HMM updating 
page tables for random virtual addresses. I'm looking for a general 
solution that will work for those future use cases. Otherwise we'll end 
up having to rewrite this page-table-update-in-fault-handler code from 
scratch in a month or two.

>
>>> I mean it's perfectly possible that the process is killed while faults
>>> are still in the pipeline.
>>>
>>>> I think it's possible that a page table gets evicted while a page 
>>>> fault
>>>> is pending. Maybe not with graphics, but definitely with compute where
>>>> waves can be preempted while waiting for a page fault. In that case 
>>>> the
>>>> direct access would break.
>>>>
>>>> Even with graphics I think it's still possible that new page tables 
>>>> need
>>>> to be allocated to handle a page fault. When that happens, you need to
>>>> wait for fences to let new page tables be validated and initialized.
>>> Yeah, the problem here is that when you wait on fences which in turn
>>> depend on your submission your end up in a deadlock.
>>>
>> I think this implies that you have amdgpu_cs fences attached to page
>> tables. I believe this is the fundamental issue that needs to be fixed.
>
> We still need this cause even with page faults the root PD can't be 
> evicted.
>
> What we can probably do is to split up the PDs/PTs into the root PD 
> and everything else.

Yeah, the root PD always exists as long as the VM exists. Everything 
else can be created/destroyed/moved dynamically.

>
>> If you want to manage page tables in page fault interrupt handlers, you
>> can't have command submission fences attached to your page tables. You
>> can allow page tables to be evicted while the command submission is in
>> progress. A page fault will fault it back in if it's needed. If you
>> eliminate command submission fences on the page tables, you remove the
>> potential for deadlocks.
>
> No, there is still a huge potential for deadlocks here.
>
> Additional to the root PDs you can have a MM submission which needs to 
> wait for a compute submission to be finished.

I assume by MM you mean "memory manger", not "multi-media". What kind of 
MM submission specifically? It can't be a VM page table update, with my 
proposal to never add user CS fences to VM page table reservations. Are 
you talking about buffer moves? They could go on a separate scheduler 
entity from VM submissions, so they don't hold up VM updates. VM updates 
only need to wait for MM fences that affect the page table BOs themselves.

>
> If you then make your new allocation depend on the MM submission to be 
> finished you have a classical circle dependency and a deadlock at hand.

I don't see it. Allocate page table, wait for fence associated with that 
page table initialization, update PTEs. At no point do we depend on the 
user CS being stalled by the page fault. There is not user submission on 
the paging ring. Anything that has been scheduled on the paging ring has 
its dependencies satisfied. We may need separate scheduler entities 
(queues) for regular MM submissions that can depend on user fences and 
VM submissions that must not.

>
> The only way around that is to allocate the new page tables with the 
> no_wait_gpu flag set and so avoid having any dependencies on ongoing 
> operations.

We discussed this before. I suggested an emergency pool for page tables. 
That pool can have a limited size. If page tables don't have user fences 
on them, they can always be evicted, so we can always make room in this 
emergency pool.

>
>> But you do need fences on page tables related to the allocation and
>> migration of page tables themselves. And your page table updates must
>> wait for those fences. Therefore I think the whole approach of direct
>> submission for page table updates is fundamentally broken.
>
> For the reasons noted above you can't have any fences related to the 
> allocation and migration on page tables.
>
> What can happen later on is that you need to wait for a BO move to 
> finish before we can update the page tables.

A page table updated coming from a page fault handler should never have 
to wait for any BO move. The fact that there was a page fault means, 
someone is trying to access this memory right now.

>
>
> But I think that this is a completely different operation which 
> shouldn't be handled in the fault handler.

Right. If you have page table updates done to prepare for a CS, they can 
depend on use fences. Page table updates done as part of the page fault 
handler must not. Again, I think this could be handled by using separate 
scheduler entities to avoid false dependencies.

>
> In those cases the fault handler would just silence the retry fault 
> and continue crunching on other faults.

I don't think that's the right approach. If you have a retry fault for a 
virtual address, it means you already have something running on the GPU 
accessing it. It can't be something that depends on an in-flight page 
table update, because the scheduler would not have emitted that to the 
ring. You either need to fill in a valid address, or if there is nothing 
mapped at that address (yet), treat it as an application error and 
convert it into a no-retry-fault which kills the application.

Regards,
   Felix

>
> As soon as the BO is moved in place we should update the page tables 
> again using the normal SDMA scheduler entity.
>
>>>> Patch #2 deals with updating page directories. That pretty much 
>>>> implies
>>>> that page tables have moved or been reallocated. Wouldn't that be a
>>>> counter-indication for using direct page table updates? In other 
>>>> words,
>>>> if you need to update page directories, a direct page table updates is
>>>> unsafe by definition.
>>> No, page tables are allocated because this is for handling invalid
>>> faults.
>>>
>>> E.g. we get a fault for an address where nothing is mapped and just
>>> want to silence it.
>> That's the scenario you have in mind now. But in the future we'll get
>> page faults for addresses that have a valid VMA, and we want to use HMM
>> to map that into the GPU page table.
>
> Yeah, but we will still need to use the same infrastructure.
>
> Avoiding waiting on ongoing operations is mandatory or otherwise we 
> will immediately run into deadlocks.
>
> Regards,
> Christian.
>
>>
>> Regards,
>>     Felix
>>
>>
>>> I will try to implement something to at least avoid accessing a
>>> destructed VM.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> Regards,
>>>>      Felix
>>>>
>>>> On 2019-06-28 8:18 a.m., Christian König wrote:
>>>>> This allows us to update page tables directly while in a page fault.
>>>>>
>>>>> Signed-off-by: Christian König<christian.koenig at amd.com>
>>>>> ---
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h      |  5 ++++
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c  |  4 +++
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 29
>>>>> +++++++++++++--------
>>>>>     3 files changed, 27 insertions(+), 11 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>>> index 489a162ca620..5941accea061 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>>> @@ -197,6 +197,11 @@ struct amdgpu_vm_update_params {
>>>>>          */
>>>>>         struct amdgpu_vm *vm;
>>>>>     +    /**
>>>>> +     * @direct: if changes should be made directly
>>>>> +     */
>>>>> +    bool direct;
>>>>> +
>>>>>         /**
>>>>>          * @pages_addr:
>>>>>          *
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
>>>>> index 5222d165abfc..f94e4896079c 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
>>>>> @@ -49,6 +49,10 @@ static int amdgpu_vm_cpu_prepare(struct
>>>>> amdgpu_vm_update_params *p, void *owner,
>>>>>     {
>>>>>         int r;
>>>>>     +    /* Don't wait for anything during page fault */
>>>>> +    if (p->direct)
>>>>> +        return 0;
>>>>> +
>>>>>         /* Wait for PT BOs to be idle. PTs share the same resv. 
>>>>> object
>>>>>          * as the root PD BO
>>>>>          */
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>>>> index ddd181f5ed37..891d597063cb 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>>>> @@ -68,17 +68,17 @@ static int amdgpu_vm_sdma_prepare(struct
>>>>> amdgpu_vm_update_params *p,
>>>>>         if (r)
>>>>>             return r;
>>>>>     -    r = amdgpu_sync_fence(p->adev, &p->job->sync, exclusive,
>>>>> false);
>>>>> -    if (r)
>>>>> -        return r;
>>>>> +    p->num_dw_left = ndw;
>>>>> +
>>>>> +    if (p->direct)
>>>>> +        return 0;
>>>>>     -    r = amdgpu_sync_resv(p->adev, &p->job->sync, root->tbo.resv,
>>>>> -                 owner, false);
>>>>> +    r = amdgpu_sync_fence(p->adev, &p->job->sync, exclusive, false);
>>>>>         if (r)
>>>>>             return r;
>>>>>     -    p->num_dw_left = ndw;
>>>>> -    return 0;
>>>>> +    return amdgpu_sync_resv(p->adev, &p->job->sync, root->tbo.resv,
>>>>> +                owner, false);
>>>>>     }
>>>>>        /**
>>>>> @@ -99,13 +99,21 @@ static int amdgpu_vm_sdma_commit(struct
>>>>> amdgpu_vm_update_params *p,
>>>>>         struct dma_fence *f;
>>>>>         int r;
>>>>>     -    ring = container_of(p->vm->entity.rq->sched, struct
>>>>> amdgpu_ring, sched);
>>>>> +    if (p->direct)
>>>>> +        ring = p->adev->vm_manager.page_fault;
>>>>> +    else
>>>>> +        ring = container_of(p->vm->entity.rq->sched,
>>>>> +                    struct amdgpu_ring, sched);
>>>>>            WARN_ON(ib->length_dw == 0);
>>>>>         amdgpu_ring_pad_ib(ring, ib);
>>>>>         WARN_ON(ib->length_dw > p->num_dw_left);
>>>>> -    r = amdgpu_job_submit(p->job, &p->vm->entity,
>>>>> -                  AMDGPU_FENCE_OWNER_VM, &f);
>>>>> +
>>>>> +    if (p->direct)
>>>>> +        r = amdgpu_job_submit_direct(p->job, ring, &f);
>>>>> +    else
>>>>> +        r = amdgpu_job_submit(p->job, &p->vm->entity,
>>>>> +                      AMDGPU_FENCE_OWNER_VM, &f);
>>>>>         if (r)
>>>>>             goto error;
>>>>>     @@ -120,7 +128,6 @@ static int amdgpu_vm_sdma_commit(struct
>>>>> amdgpu_vm_update_params *p,
>>>>>         return r;
>>>>>     }
>>>>>     -
>>>>>     /**
>>>>>      * amdgpu_vm_sdma_copy_ptes - copy the PTEs from mapping
>>>>>      *
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>