[PATCH RFC 11/18] drm/scheduler: Clean up jobs when the scheduler is torn down

Mon Mar 13 20:11:14 UTC 2023

On Fri, 2023-03-10 at 18:58 +0900, Asahi Lina wrote:
> On 10/03/2023 04.59, Faith Ekstrand wrote:
> > On Thu, 2023-03-09 at 18:43 +0900, Asahi Lina wrote:
> > > On 09/03/2023 17.42, Christian König wrote:
> > > > Long story short: Don't do this! This is what the Windows
> > > > drivers
> > > > have 
> > > > been doing and it creates tons of problems.
> > 
> > Yeah, we tried to do a bit of that in the GL days.  It was a bad
> > idea.
> 
> I think I should clarify: I was proposing re-queueing innocent jobs
> from
> innocent queues/VMs that were impacted by a fault. The reason is that
> we
> may be able to tweak firmware state to force it to do that safely,
> during the firmware recovery cycle, such that an aborted job restarts
> and then subsequent jobs/commands continue as normal. We can't leave
> it
> to userspace because if we do nothing, the affected job ends up
> incomplete but then everything after it that is already queued still
> runs, and that is definitely a recipe for a bigger mess if userspace
> wants to seamlessly recover. The firmware recovery cycle is a
> "stop-the-world" situation for the GPU (the firmware literally
> busy-loops waiting for the driver to set a continue flag in
> memory...),
> so that's the only real chance that the driver gets to make decisions
> about what is going to happen next.

Ok, that makes sense.  Yes, if you have other jobs on other queues and
are able to recover everything that isn't in the faulting VM, that's a
good thing.  I wasn't sure how hang/fault recovery worked on AGX.  In
tat case, I don't think there's a dma_fence problem.  As long as you
keep recovering and killing off any faulting contexts, eventually the
good contexts should make progress and those fences should signal.

Of course, the firmware recovery cycle may be complex and need (or at
least appear to) memory allocation or similar and that's where
everything gets hairy.  Hopefully, though, if you've already got the
resources from the old context, you can re-use them after a bit of
clean-up work and still get deterministic and reliable recovery cycles.

> Of course, that only works if individual possibly concurrently
> running
> commands are idempotent, but I think a lot of typical GPU work is?

No, that's not a valid assumption.  For a single 3D render pass which
doesn't do any image or SSBO access, it may be possible to re-run it. 
However, that won't be true of compute work and isn't necessarily true
of back-to-back passes. Lots of modern apps do temporal stuff where one
frame depends on the previous and a re-run might screw that up. Also,
with Vulkan's memory aliasing, it's hard to tell just from which
resources are accessed whether or not a command buffer leaves its input
memory undamaged.

> (E.g.
> any render pass without side effects other than the render targets
> and
> where the background shader does no loads, or even render passes that
> do
> loads but where all draws are opaque, which are all things the
> current
> Gallium driver is intimately familiar with since Crazy Tiler
> Optimizations™ need that info to be provided anyway). So I was
> wondering
> whether it'd make sense to have such an idempotency/restartable flag
> on
> job submission, and then the driver would do its best to recover and
> rerun it if it gets killed by an unrelated concurrent bad job.
> 
> Then again this all depends on an investigation into what we *can* do
> during firmware recovery that hasn't happened at all yet. It might be
> that it isn't safe to do anything really, or that doing things
> depends
> on touching even deeper firmware state structs that we treat as
> opaque
> right now and we really don't want to have to touch...
> 
> But maybe none of this is worth it in practice, it just sounded like
> it
> could be useful maybe?

Maybe? It's not clear to me that such a flag would be useful or even
practical to provide from the Mesa side.  Ideally, you'd be able to
figure out when a fault happens, what VM it happened in and exactly
what work was in-flight when it happened and only kill the one guilty
VM.  However, it sounds like your understanding of the firmware is
currently rough enough that doing so may not be practical.  In that
case, the best thing to do is to kill any VMs which were on the GPU at
the time and hope the individual apps are able to recover.

> Now that I look at it, we have a lovely "what is this flag doing
> anyway"
> bit already passed from Mesa through to the firmware we called
> ASAHI_RENDER_SET_WHEN_RELOADING_Z_OR_S which, now that I look at it,
> is
> actually getting set when any attachment (any color, Z, S) is not
> being
> cleared for that pass (so it's loaded). That could very well be an
> "is
> not idempotent" flag... and maybe that means the firmware does this
> for
> us already? Sounds like something to test... I might have some
> 16Kx16K
> GLmark runs to do concurrent with an evil faulting job now ^^ (and
> then
> that also means we need to set it when shaders have side effects and
> stuff, which right now we don't).
> 
> > > > Just signal the problem back to userspace and let the user
> > > > space
> > > > driver 
> > > > decide what to do.
> > > > 
> > > > The background is that most graphics applications (games etc..)
> > > > then 
> > > > rather start on the next frame instead of submitting the
> > > > current
> > > > one 
> > > > again while compute applications make sure that the abort and
> > > > tell
> > > > the 
> > > > user that the calculations might be corrupted and need to be
> > > > redone.
> > 
> > The guarantee that Vulkan makes is that, if you idle the GPU and
> > you
> > haven't gotten a DEVICE_LOST yet, your data is good.  If you get a
> > DEVICE_LOST, all bets are off.  The problem is that, no matter how
> > fast
> > the error propagation may be in the kernel or userspace driver,
> > errors
> > can still show up in strange ways.  An OOB buffer access could end
> > up
> > modifying a shader binary which gets run 3 frames later and causes
> > a
> > corruption.  Once you've faulted, you really have no idea how far
> > back
> > is good or what memory is corrupted.  You have to assume that
> > everything mapped to the GPU VA space is potentially toast.
> 
> Yes of course, for the actually faulting VM all bets are off after a
> fault (though we can try a bit harder at least... I have a READ_ONLY
> BO
> flag now, I should set it on the shader pools!).
> 
> > > Actually I wanted to ask about error notifications. Right now we
> > > have
> > > an
> > > out-of-band mechanism to provide detailed fault info to userspace
> > > which
> > > works fine, but in principle it's optional.
> > 
> > This is fine, in principal.  Because of the nature of errors, async
> > is
> > fine as long as the error shows up eventually.  Faster is better,
> > for
> > sure, but error latency doesn't really matter in practice.
> > 
> > > However, I also mark the hw
> > >  fences as errored when a fault happens (with an errno that
> > > describes
> > > the overall situation), but that never makes it into the
> > > drm_sched
> > > job
> > > complete fence. I looked at the drm_sched code and I didn't see
> > > any
> > > error propagation. Is that supposed to work, or am I supposed to
> > > directly mark the drm_sched side fence as complete, or did I
> > > misunderstand all this? I get the feeling maybe existing drivers
> > > just
> > > rely on the recovery/timeout/etc paths to mark jobs as errored
> > > (since
> > > those do it explicitly) and never need error forwarding from the
> > > hw
> > > fence?
> > 
> > The end behavior needs to be that all fences for all jobs submitted
> > to
> > the queue get signaled.  That's needed to satisfy the finite time
> > guarantees of dma_fence.  Exactly how that happens (let the job
> > run,
> > abort all the jobs, etc.) is an implementation detail for the
> > driver to
> > decide.  If you want, you can also set a bit on the context (or
> > queue)
> > to mark it as dead and start returning EIO or similar from any
> > ioctls
> > trying to submit more work if you wanted.  Not required but you
> > can.
> 
> Fences have an error flag though, does that get reported to userspace
> somehow? I thought it did, but maybe not, or maybe only drm_sched not
> propagating it is the issue?
> 
> In other words, absent my fancy stats reporting BO system, what is
> the
> normal way that an explicit sync driver signals to userspace that the
> job associated with a syncobj has failed?

One is via the return value from exec/submit.  Often there's also a
query mechanism for more detailed information.  It's not particularly
standard at the moment, I'm afraid.  I could point you at i915 but I
wouldn't call that uAPI something to be emulated, in general.

> (If there is no way, then I'll probably want to change the stats BO
> system to be configurable, so if you ask for no stats/time info, you
> only get overall job status and faults, which has less overhead.)

There is an error but it doesn't automatically get propagated to
userspace.  So, for instance, a SYNCOBJ_WAIT ioctl won't return an
error if it sees a fence error.  It needs to get caught by the driver
and returned through a driver ioctl somehow.

~Faith