[Intel-gfx] [PATCH 7/7] drm/i915/gem: Acquire all vma/objects under reservation_ww_class

Fri Jun 26 17:44:26 UTC 2020

Am 26.06.20 um 15:08 schrieb Chris Wilson:
> Quoting Christian König (2020-06-26 12:35:30)
>> Am 26.06.20 um 13:10 schrieb Chris Wilson:
>>> Quoting Christian König (2020-06-26 09:54:19)
>>> [SNIP]

>>> What about checkpoint/restore, suspend/resume? Where we need to suspend
>>> all execution, move all the resources to one side, then put everything
>>> back, without cancelling the fences. Same halting problem, no?
>> What are you talking about? Of course we either wait for all fences to
>> complete or cancel them on suspend.
> I do not want to have to cancel incomplete fences as we do today.

But this is a necessity. Putting away halve executed fences and starting 
them later on is not possible and most likely never will be.

>> So why wait in the middle of submission, rather than defer the submission
>> to the fence callback if the HW wasn't ready? You then have your
>> uninterruptible continuation.

Because you don't wait in the middle of the submission, but rather 
before the submission is made and resources or locks are acquired.

That's also the reason why it is illegal to wait for a fence to appear 
with a reservation lock held and that is also what lockdep should be 
able to point out as well.

See amdgpu_cs_ioctl() for an example of why this is necessary:

         r = amdgpu_cs_dependencies(adev, &parser);
...
         r = amdgpu_cs_parser_bos(&parser, data);

amdgpu_cs_dependencies() is waiting for the wait before signal fences to 
appear and amdgpu_cs_parser_bos() is grabbing the reservation locks.

Do it the other way around and lockdep at least should splat that this 
has deadlock potential.

And you are running into exactly the same case here as well, just in a 
bit more complicated because userspace is involved.

>>> But if you have chosen to cancel the fences, there is nothing to stop
>>> the signaling.
>> And just to repeat myself: You can't cancel the fence!
>>
>> For example assume that canceling the proxy fence would mean that you
>> send a SIGKILL to the process which issued it. But then you need to wait
>> for the SIGKILL to be processed.
> What? Where does SIGKILL come from for fence handling?

Sorry, that was just an example how to handle it. A lock or an event is 
also possible.

> The proxy fence is force signaled in an error state (e.g. -ETIMEDOUT),
> every waiter then inherits the error state and all of their waiters down
> the chain. Those waiters are now presumably ready to finish their own
> signaling.

That alone is illegal. See currently fences are only allowed to signal 
if all their previous dependencies are signaled, even in an error case.

This is because we replace all the fences in a dma_resv object when we 
add a new exclusive one.

> The proxy fence is constructed to always complete if it does not get
> resolved; after resolution, the onus is on the real fence to complete.

But then it is not useful at all. See in this case you can't wait on the 
proxy fence at all.

In other words when you try to wait and the underlying real submission 
has not yet appeared you must return with an error immediately.

>>> However, I say that is under our control. We know what fences are in an
>>> execution context, just as easily as we know that we are inside an
>>> execution context. And yes, the easiest, the most restrictive way to
>>> control it is to say don't bother.
>> No, that is absolutely not under our control.
>>
>> dma_fences need to be waited on under a lot of different context,
>> including the reclaim path as well as the MMU notifiers, memory pressure
>> callbacks, OOM killer....
> Oh yes, they are under our control. That list boils down to reclaim,
> since mmu notifiers outside of reclaim are outside of a nested context.

Nested context is irrelevant here. Let's see the following example:

We use dma_fence_proxy because userspace wants to do a delayed submission.

This dma_fence_proxy is attached to a dma_resv object because we need 
the implicit dependency for DRI2/DRI3 handling.

Now the process calls fork() and an MMU notifier is triggered. This MMU 
notifier then waits for the dma_resv object fences to complete.

But to complete the fences the fork() call needs to complete first -> 
deadlock.

> That in particular is the same old question as whether GFP_IO should be
> a gfp_t or in the task_struct. If we are inside an execution context, we
> can track that and the fences on the task_struct if we wanted to,
> avoiding reclaim of fences being used by the outer context and their
> descendants...

Oh, yes that is correct and an absolutely brilliant example of why this 
doesn't work :D

See the difference is that in this case userspace is involved.

In other words in your example you would set the GFP_IO flag in the 
task_struct and then return from your IOCTL and waiting for the next 
IOCTL to clear it again.

And that in turn is not something we can do.

>> No, as far as I can see you don't seem to either understand the problem
>> or the implications of it.
>>
>> The only way to solve this would be to audit the whole Linux kernel and
>> remove all uninterruptible waits and that is not feasible.
>>
>> As long as you don't provide me with a working solution to the problem
>> I've outlined here the whole approach is a clear NAK since it will allow
>> to create really bad kernel deadlocks.
> You are confusing multiple things here. The VkEvents example is real.
> How do you avoid that deadlock? We avoid it by not waiting in direct
> reclaim.

I'm perfectly aware of what are you trying to do here cause the AMD 
engineers have suggested and tried the exact same thing. And yes we have 
already rejected that as well.

> It has also shown up any waits in our submit ioctl [prior to fence
> publication, I might add] for their potential deadlock with userspace.

No that approach is provable deadlock free.

See as I explained with the amdgpu_cs example above it is as simple as 
waiting for the fences to appear without any memory management relevant 
locks held.

As soon as you leak waiting for fences to appear into other parts of the 
kernel you are making those parts depend on the welfare of the userspace 
process and that's what doesn't work.

Sorry that I'm so insisting on this, but we have already tried this 
approach and discussed it more than once and it really does not work 
correctly.

Regards,
Christian.

> -Chris