[PATCH 1/5] drm/amdgpu: allow direct submission in the VM backends

Thu Jul 18 22:34:18 UTC 2019

On 2019-07-18 4:47 a.m., Christian König wrote:

[snip]

>>> This is a corner case we can handle later on. As long as the VM is
>>> still alive just allocating page tables again should be sufficient for
>>> this.
>> Do you mean, instead of migrating page tables back, throwing them away
>> and allocating a new one?
>
> Yes, exactly that's the idea here. 

OK, I think that would work. The thing with direct submission and not 
waiting for fences is, that you only have implicit synchronization with 
anything else that was also submitted directly to the same SDMA ring. So 
page table allocation and initialization would work fine. Migration 
would not, unless you have special cases for migration of page table BOs.

There is also a more general issue with direct submission that I found 
while you were on vacation. There is no locking of the ring buffer. So 
direct and non-direct submission to the same ring is broken at the moment.

>>>>> I mean it's perfectly possible that the process is killed while 
>>>>> faults
>>>>> are still in the pipeline.
>>>>>
>>>>>> I think it's possible that a page table gets evicted while a page
>>>>>> fault
>>>>>> is pending. Maybe not with graphics, but definitely with compute 
>>>>>> where
>>>>>> waves can be preempted while waiting for a page fault. In that case
>>>>>> the
>>>>>> direct access would break.
>>>>>>
>>>>>> Even with graphics I think it's still possible that new page tables
>>>>>> need
>>>>>> to be allocated to handle a page fault. When that happens, you 
>>>>>> need to
>>>>>> wait for fences to let new page tables be validated and initialized.
>>>>> Yeah, the problem here is that when you wait on fences which in turn
>>>>> depend on your submission your end up in a deadlock.
>>>>>
>>>> I think this implies that you have amdgpu_cs fences attached to page
>>>> tables. I believe this is the fundamental issue that needs to be 
>>>> fixed.
>>> We still need this cause even with page faults the root PD can't be
>>> evicted.
>>>
>>> What we can probably do is to split up the PDs/PTs into the root PD
>>> and everything else.
>> Yeah, the root PD always exists as long as the VM exists. Everything
>> else can be created/destroyed/moved dynamically.
>
> Yeah, exactly. The question is how do we want to keep the root PD in 
> place?
>
> We could still add the fence or we could pin it permanently.

Right. I was thinking permanent pinning can lead to fragmentation. It 
would be good if those small root PDs could be moved around to make room 
for bigger contiguous allocations when needed.

>
>>>> If you want to manage page tables in page fault interrupt handlers, 
>>>> you
>>>> can't have command submission fences attached to your page tables. You
>>>> can allow page tables to be evicted while the command submission is in
>>>> progress. A page fault will fault it back in if it's needed. If you
>>>> eliminate command submission fences on the page tables, you remove the
>>>> potential for deadlocks.
>>> No, there is still a huge potential for deadlocks here.
>>>
>>> Additional to the root PDs you can have a MM submission which needs to
>>> wait for a compute submission to be finished.
>> I assume by MM you mean "memory manger", not "multi-media". [SNIP]
>
> Sorry I meant "multi-media", so just snipped your response.
>
> What I want to say here is that I don't believe we can keep user CS 
> fences our of memory management.
>
> See there can be submission from engines which don't support or don't 
> want to enabled recoverable page faults which depend on submissions 
> which do use recoverable page faults.
>
> I mean it was your requirement that we have a mix of page fault and 
> pre-filled page tables in the same process.

Right. There are a few different requirements:

 1. Disable retry faults and instruction replay for a VM completely
    (better performance for ML shaders)
 2. Pre-fill page tables even when retry faults are enabled

In case #2 we could deal with page tables being evicted (not fenced). 
But MM engines that don't support retry faults would throw a wrench in 
this idea.

>
>>> If you then make your new allocation depend on the MM submission to be
>>> finished you have a classical circle dependency and a deadlock at hand.
>> I don't see it. Allocate page table, wait for fence associated with that
>> page table initialization, update PTEs. At no point do we depend on the
>> user CS being stalled by the page fault. There is not user submission on
>> the paging ring. Anything that has been scheduled on the paging ring has
>> its dependencies satisfied.
>
> Allocation is the main problem here. We need to make sure that we 
> never ever depend on user CS when making memory allocation in the page 
> fault handler.
>> We may need separate scheduler entities
>> (queues) for regular MM submissions that can depend on user fences and
>> VM submissions that must not.
>
> Yeah, thought about that as well but even then you need a way to note 
> that you want to use this separate entity.
>
>>> The only way around that is to allocate the new page tables with the
>>> no_wait_gpu flag set and so avoid having any dependencies on ongoing
>>> operations.
>> We discussed this before. I suggested an emergency pool for page tables.
>> That pool can have a limited size. If page tables don't have user fences
>> on them, they can always be evicted, so we can always make room in this
>> emergency pool.
>
> You underestimate the problem. For page tables I can make sure rather 
> easily that we can always allocate something, but ALL allocations made 
> during page fault can't depend on user CS.
>
> This means we need to use this for pages which are used for HMM based 
> migration and for this you can't have a fixed pool.

That is if you do migration in the page fault handler. We could do 
migration outside of the page fault handler. See below.

>
>>>> But you do need fences on page tables related to the allocation and
>>>> migration of page tables themselves. And your page table updates must
>>>> wait for those fences. Therefore I think the whole approach of direct
>>>> submission for page table updates is fundamentally broken.
>>> For the reasons noted above you can't have any fences related to the
>>> allocation and migration on page tables.
>>>
>>> What can happen later on is that you need to wait for a BO move to
>>> finish before we can update the page tables.
>> A page table updated coming from a page fault handler should never 
>> have  could want to migrate memory
>> to wait for any BO move. The fact that there was a page fault means,
>> someone is trying to access this memory right now.
>
> Well essentially with HMM we want to migrate memory to VRAM during the 
> page fault handler, don't we?

The page fault handler could inform migration decisions. But the 
migration itself doesn't need to be in the page fault handler. A 
migration can be on an application thread (e.g. triggered by an hbind or 
similar call) or on a worker thread that gets triggered by asynchonous 
events such as page faults, polling of performance counters, etc. A 
migration would trigger an MMU notifier that would invalidate the 
respective page table entries. Updating the page table entries with the 
new physical addresses would happen likely in a page fault handler after 
the migration is complete.

To minimize the amount of page faults while the migration is in progress 
we could also put PTEs in silent retry mode first. After the migration 
is complete we could update the PTEs with non-silent retry or with the 
new physical addresses.

>
>>> But I think that this is a completely different operation which
>>> shouldn't be handled in the fault handler.
>> Right. If you have page table updates done to prepare for a CS, they can
>> depend on use fences. Page table updates done as part of the page fault
>> handler must not. Again, I think this could be handled by using separate
>> scheduler entities to avoid false dependencies.
>
> Agreed, but using a separate entity means that we are sending the 
> updates to a separate kernel thread first which then commits them to 
> the ring buffer.
>
> I was already a step further and thought that we can avoid this extra 
> overhead and write directly to the ring buffer.

OK. If we can justify the assumptions made in the direct submission 
code. Basically we can only rely on implicit synchronization with other 
operation that use direct submission. That means all page table 
migration would have to use direct submission. Or we can't migrate page 
tables and instead reallocate them every time.

>
>
>>> In those cases the fault handler would just silence the retry fault
>>> and continue crunching on other faults.
>> I don't think that's the right approach. If you have a retry f
>> ault for a
>> virtual address, it means you already have something running on the GPU
>> accessing it. It can't be something that depends on an in-flight page
>> table update, because the scheduler would not have emitted that to the
>> ring. You either need to fill in a valid address, or if there is nothing
>> mapped at that address (yet), treat it as an application error and
>> convert it into a no-retry-fault which kills the application.
>
> Mhm, and what do we do if we want to migrate a page to VRAM in a fault 
> handler?
>
> I mean that's what HMM is mostly all about, isn't it?

Not really. I don't think we need or want to migrate in a page fault 
handler. It's the other way around. Page faults may be the result of a 
migration, because it would trigger an MMU notifier that invalidates 
PTEs. Page faults are GPU-specific. Migrations can affect page tables on 
multiple GPUs that access the same pages.

We'll need to deal with some interesting cases when multiple GPUs fault 
on the same page nearly at the same time. In that case it may be better 
to leave that page in system memory where it can be accessed by multiple 
GPUs, rather than bouncing it around between GPUs. With XGMI we'd have 
even more options to consider.

Initially I'm planning not to put too much intelligence into any 
automatic migration heuristics and rely more on hints from applications. 
Get the mechanics right first, then add policy and heuristics on top.

Regards,
   Felix

>
> Regards,
> Christian.
>
>>
>> Regards,
>>     Felix
>>
>