[Intel-gfx] [PATCH 7/7] drm/i915/gem: Acquire all vma/objects under reservation_ww_class

Fri Jun 26 11:35:30 UTC 2020

Am 26.06.20 um 13:10 schrieb Chris Wilson:
> Quoting Christian König (2020-06-26 09:54:19)
> [SNIP]
>> In other words "fence -> userspace -> page fault -> fence" or "fence ->
>> userspace -> system call -> fence" can easily cause the same problem and
>> that is not avoidable.
>>
>>> An example
>>>
>>> Thread A                              Thread B
>>>
>>>        submit(VkCmdWaitEvents)
>>>        recvfrom(ThreadB)       ...     sendto(ThreadB)
>>>                                        \- alloc_page
>>>                                         \- direct reclaim
>>>                                          \- dma_fence_wait(A)
>>>        VkSetEvent()
>>>
>>> Regardless of that actual deadlock, waiting on an arbitrary fence incurs
>>> an unbounded latency which is unacceptable for direct reclaim.
>>>
>>> Online debugging can indefinitely suspend fence signaling, and the only
>>> guarantee we make of forward progress, in some cases, is process
>>> termination.
>> And exactly that is what doesn't work. You don't have any forward
>> progress any more because you ran into a software deadlock.
> Only one side is halted. Everything on that side comes to a grinding
> halt.
>
> What about checkpoint/restore, suspend/resume? Where we need to suspend
> all execution, move all the resources to one side, then put everything
> back, without cancelling the fences. Same halting problem, no?

What are you talking about? Of course we either wait for all fences to 
complete or cancel them on suspend.

> We also do similar for resets. Suspend the hanging context, move it and
> all dependent execution off to one side; record what we can, clean up
> what we have to, then move what remains of the execution back to finish
> signaling.

Yes, but this is not possible in this situation. In the bad case you 
have a kernel deadlock and that can't be cleaned up in any way.

The only solution left in that situation is to reset the system or at 
least reload the kernel and that is not acceptable.

>> In other words the signaling of a fence depends on the welfare of
>> userspace. You can try to kill userspace, but this can wait for the
>> fence you try to signal in the first place.
> The only scenario that fits what you are describing here [userspace
> ignoring a signal] is if you used an uninterruptible wait. Under what
> circumstances during normal execution would you do that? If it's
> someone else's wait, a bug outside of our control.

Uninterruptible waits are a necessity.

Just take a look at the dma_fence_wait() interface. Why to you think we 
have ability to wait uninterruptible there?

We need this when there is no other way of recovering. For example when 
operations are already partially flushed to the hardware and can't be 
aborted any more.

> But if you have chosen to cancel the fences, there is nothing to stop
> the signaling.

And just to repeat myself: You can't cancel the fence!

For example assume that canceling the proxy fence would mean that you 
send a SIGKILL to the process which issued it. But then you need to wait 
for the SIGKILL to be processed.

Now what can happen is that the process is uninterruptible waiting for 
something which then needs the SIGKILL to be delivered -> kernel deadlock.

>> See the difference to a deadlock on the GPU is that you can can always
>> kill a running job or process even if it is stuck with something else.
>> But if the kernel is deadlocked with itself you can't kill the process
>> any more, the only option left to get cleanly out of this is to reboot
>> the kernel.
> However, I say that is under our control. We know what fences are in an
> execution context, just as easily as we know that we are inside an
> execution context. And yes, the easiest, the most restrictive way to
> control it is to say don't bother.

No, that is absolutely not under our control.

dma_fences need to be waited on under a lot of different context, 
including the reclaim path as well as the MMU notifiers, memory pressure 
callbacks, OOM killer....

Just see Daniels patches on the lockdep fence signaling annotation and 
what this work bubbled up on problems.

>> The only way to avoid this would be to never ever wait for the fence in
>> the kernel and then your whole construct is not useful any more.
> I advocate for moving as much as is feasible, for some waits are required
> by userspace as a necessary evil, into the parallelised pipeline.
>
>> I'm running out of ideas how to explain what the problem is here....
> Oh we agree on the problem, we appear to disagree that the implicit waits
> themselves are a serious existent problem. That they are worth effort to
> avoid or, at least, mitigate.

No, as far as I can see you don't seem to either understand the problem 
or the implications of it.

The only way to solve this would be to audit the whole Linux kernel and 
remove all uninterruptible waits and that is not feasible.

As long as you don't provide me with a working solution to the problem 
I've outlined here the whole approach is a clear NAK since it will allow 
to create really bad kernel deadlocks.

Sorry to say that, but this whole thing doesn't look like it is thought 
through to the end. You should probably take a step back and talk to 
Daniel here.

Regards,
Christian.

> -Chris