[Intel-gfx] [PATCH 7/7] drm/i915/gem: Acquire all vma/objects under reservation_ww_class

Thu Jun 25 17:42:41 UTC 2020

Quoting Christian König (2020-06-25 16:47:09)
> Am 25.06.20 um 17:10 schrieb Chris Wilson:
> > We have the DAG of fences, we can use that information to avoid adding
> > an implicit coupling between execution contexts.
> 
> No, we can't. And it sounds like you still have not understood the 
> underlying problem.
> 
> See this has nothing to do with the fences itself or their DAG.
> 
> When you depend on userspace to do another submission so your fence can 
> start processing you end up depending on whatever userspace does.

HW dependency on userspace is explicit in the ABI and client APIs, and
the direct control userspace has over the HW.

> This in turn means when userspace calls a system call (or does page 
> fault) it is possible that this ends up in the reclaim code path.

We have both said the very same thing.

> And while we want to avoid it both Daniel and I already discussed this 
> multiple times and we agree it is still a must have to be able to do 
> fence waits in the reclaim code path.

But came to the opposite conclusion. For doing that wait harms the
unrelated caller and the reclaim is opportunistic. There is no need for
that caller to reclaim that page, when it can have any other. Why did you
even choose that page to reclaim? Inducing latency in the caller is a bug,
has been reported previously as a bug, and still considered a bug. [But at
the end of the day, if the system is out of memory, then you have to pick
a victim.]

> So what happens is that you have a dependency between fence submission 
> -> userspace -> reclaim path -> fence submission. And that is a circle 
> dependency, no matter what your DAG looks like.

Sigh. We have both said the very same thing.

> In other words this whole approach does not work, is a clear NAK and I 
> can only advise Dave to *not* merge it.

If you are talking about the proxy? Then it looks like this [if you
insist on having that wait in the reclaim]
1. userspace submits request, waiting for the future
2. other thread that is due to signal, enters kernel, hits direct reclaim,
waits for the future fence [because you insist on this when it is not
necessary and is a unbounded latency issue for general cases],
1. times out

vs

1. userspace submits wait-for-submit; blocks
2. other thread enters kernel and waits for reclaim on another arbitrary
fence, or anything, could even be waiting for a signal from 1.
1. times out

Userspace directly controls fence signaling. Any wait whatsoever could
be a deadlock on a resource that is outside of our [immediate] control.
Further if that wait is underneath a mutex or other semaphore that it
can cause another client to contend with, it is now able to inject its
deadlock into an witting partner.
-Chris