[Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

Fri Jun 19 08:25:11 UTC 2020

Quoting Daniel Stone (2020-06-11 10:01:46)
> Hi,
> 
> On Thu, 11 Jun 2020 at 09:44, Dave Airlie <airlied at gmail.com> wrote:
> > On Thu, 11 Jun 2020 at 18:01, Chris Wilson <chris at chris-wilson.co.uk> wrote:
> > > Introducing a global lockmap that cannot capture the rules correctly,
> >
> > Can you document the rules all drivers should be following then,
> > because from here it looks to get refactored every version of i915,
> > and it would be nice if we could all aim for the same set of things
> > roughly. We've already had enough problems with amdgpu vs i915 vs
> > everyone else with fences, if this stops that in the future then I'd
> > rather we have that than just some unwritten rules per driver and
> > untestable.
> 
> As someone who has sunk a bunch of work into explicit-fencing
> awareness in my compositor so I can never be blocked, I'd be
> disappointed if the infrastructure was ultimately pointless because
> the documented fencing rules were \_o_/ or thereabouts. Lockdep
> definitely isn't my area of expertise so I can't comment on the patch
> per se, but having something to ensure we don't hit deadlocks sure
> seems a lot better than nothing.

This is doing dependency analysis on execution contexts which is a far
cry from doing the fence dependency analysis, and so has to actively
ignore the cycles that must exist on the dma side, and also the cycles
that prevent entering execution contexts on the CPU. It has to actively
ignore scheduler execution contexts, for lockdep cries, and so we do not
get analysis of the locking contexts along that path. This would be
solvable along the lines of extending lockdep ala lockdep_dma_enter().

Had i915's execution flow been marked up, it should have found the
dubious wait for external fences inside the dead GPU recovery, and
probably found a few more things to complain about with the reset locking.
[Note we already do the same annotations for wait-vs-reset, but not
reset-vs-execution.]

Determination of which waits are legal and which are not is entirely ad
hoc, for there is no status change tracking in the dependency analysis
[that is once an execution context is linked to a published fence, again
integral to lockdep.] Consider if the completion chain in atomic is
swapped out for the morally equivalent fences along intertwined timelines,
and so it does a bunch of dma_fence_wait() instead. Why are those waits
legal despite them being after we have committed to fulfilling the out
fence? [Why are the waits on and for the GPU legal, since they equally
block execution flow?]

Forcing a generic primitive to always be part of the same global map is
horrible. You forgo being able to use the primitive for unrelated tasks,
lose the ability to name particular contexts to gain more informative
dependency cycle reports from having the explicit linkage. You can add
wait_map tracking without loss of generality [in less than 10 lines],
and you can still enforce that all fences used for a common purpose
follow the same rules [the simplest way being to default to the singular
wait_map]. But it's the explicitly named execution contexts that are the
biggest boon to reading the code and reading the lockdep warns.

This is a bunch of ad hoc tracking for a very narrow purpose applied
globally, with loss of information.
-Chris