[PATCH RFC 11/18] drm/scheduler: Clean up jobs when the scheduler is torn down

Thu Mar 9 19:59:14 UTC 2023

On Thu, 2023-03-09 at 18:43 +0900, Asahi Lina wrote:
> On 09/03/2023 17.42, Christian König wrote:
> > Am 08.03.23 um 20:37 schrieb Asahi Lina:
> > > On 09/03/2023 03.12, Christian König wrote:
> > > > Am 08.03.23 um 18:32 schrieb Asahi Lina:
> > > > > [SNIP]
> > > > > Yes but... none of this cleans up jobs that are already
> > > > > submitted by the
> > > > > scheduler and in its pending list, with registered completion
> > > > > callbacks,
> > > > > which were already popped off of the entities.
> > > > > 
> > > > > *That* is the problem this patch fixes!
> > > > Ah! Yes that makes more sense now.
> > > > 
> > > > > > We could add a warning when users of this API doesn't do
> > > > > > this
> > > > > > correctly, but cleaning up incorrect API use is clearly
> > > > > > something we
> > > > > > don't want here.
> > > > > It is the job of the Rust abstractions to make incorrect API
> > > > > use that
> > > > > leads to memory unsafety impossible. So even if you don't
> > > > > want that in
> > > > > C, it's my job to do that for Rust... and right now, I just
> > > > > can't
> > > > > because drm_sched doesn't provide an API that can be safely
> > > > > wrapped
> > > > > without weird bits of babysitting functionality on top (like
> > > > > tracking
> > > > > jobs outside or awkwardly making jobs hold a reference to the
> > > > > scheduler
> > > > > and defer dropping it to another thread).
> > > > Yeah, that was discussed before but rejected.
> > > > 
> > > > The argument was that upper layer needs to wait for the hw to
> > > > become
> > > > idle before the scheduler can be destroyed anyway.
> > > Unfortunately, that's not a requirement you can encode in the
> > > Rust type
> > > system easily as far as I know, and Rust safety rules mean we
> > > need to
> > > make it safe even if the upper layer doesn't do this... (or else
> > > we have
> > > to mark the entire drm_sched abstraction unsafe, but that would
> > > be a pity).
> > 
> > Yeah, that should really not be something we should do.
> > 
> > But you could make the scheduler depend on your fw context object,
> > don't 
> > you?
> 
> Yes, and that would fix the problem for this driver, but it wouldn't
> make the abstraction safe. The thing is we have to make it
> *impossible*
> to misuse drm_sched in such a way that it crashes, at the Rust
> abstraction level. If we start depending on the driver following
> rules
> like that, that means the drm_sched abstraction has to be marked
> unsafe.
> 
> > Detaching the scheduler from the underlying hw fences is certainly 
> > possible, but we removed that functionality because some people
> > people 
> > tried to force push some Windows recovery module into Linux. We are
> > in 
> > the process of reverting that and cleaning things up once more, but
> > that 
> > will take a while.
> 
> Okay, but I don't see why that should block the Rust abstractions...
> I
> don't even need a new API to do that, all I need is to know that
> drm_sched_fini() will do it so it won't crash when the hw fences
> complete later, as this patch does.
> 
> > Instead of detaching you could also block for the hw to become
> > idle, but 
> > if you do that synchronous on process termination you run into
> > trouble 
> > as well.
> 
> Yes, but again this something that can only be done at the driver
> level
> so it doesn't solve the safe abstraction problem...
> 
> > > The firmware queue is itself reference counted and any firmware
> > > queue
> > > that has acquired an event notification resource (that is, which
> > > is busy
> > > with running or upcoming jobs) hands off a reference to itself
> > > into the
> > > event subsystem, so it can get notified of job completions by the
> > > firmware. Then once it becomes idle it unregisters itself, and at
> > > that
> > > point if it has no owning userspace queue, that would be the last
> > > reference and it gets dropped. So we don't tear down firmware
> > > queues
> > > until they are idle.
> > 
> > And could those fw queue not reference the scheduler?
> 
> Yes but again, that rule can't be encoded in the abstraction... so
> that
> makes it unsafe. The goal is to have a safe abstraction, which means
> that all the rules that you need to follow to avoid memory safety
> issues
> are checked by the Rust compiler.
> 
> > > I actually don't know of any way to actively abort jobs on the
> > > firmware,
> > > so this is pretty much the only option I have. I've even seen
> > > long-running compute jobs on macOS run to completion even if you
> > > kill
> > > the submitting process, so there might be no way to do this at
> > > all.
> > > Though in practice since we unmap everything from the VM anyway
> > > when the
> > > userspace stuff gets torn down, almost any normal GPU work is
> > > going to
> > > immediately fault at that point (macOS doesn't do this because
> > > macOS
> > > effectively does implicit sync with BO tracking at the kernel
> > > level...).
> > 
> > Oh, that is an interesting information. How does macOS do explicit
> > sync 
> > then or isn't that supported at all?
> 
> They have the equivalent of sync objects at the UAPI level, but they
> also have the implicit stuff and their UAPI seems to always pass a BO
> list to the kernel as far as we could tell, even though it still
> works
> without it. I think it's a weird hybrid of explicit+implicit sync.
> From
> the Metal docs:
> 
> > By default, Metal tracks the write hazards and synchronizes the
> > resources
> > (see Resource Fundamentals) you create from an MTLDevice and
> > directly bind
> > to a pipeline. However, Metal doesn’t, by default, track resources
> > you
> > allocate from an MTLHeap (see Memory Heaps).
> 
> So it's both, and you can override it...
> 
> At the firmware level, I've never seen Metal use queue barriers yet
> like
> I do (other than the vertex->fragment ones), so either they always do
> CPU round trips for cross-subqueue sync (render<->compute) or we just
> haven't figured out the magic combination to get it to do that yet.
> Honestly, I suspect they just always do it on the CPU. macOS is
> pretty
> ugly behind the scenes and it's pretty obvious a lot of their own
> driver
> was rushed (the firmware seems to support quite a few features the
> driver doesn't... maybe it even has a job abort mechanism, we just
> haven't found it yet).
> 
> Of course, our goal is to do things better than macOS (and we already
> do
> some things better!) but getting confident enough about firmware/HW
> details to diverge from what macOS does is tricky and a slow
> process...
> 
> > > By the way, I don't really use the hardware recovery stuff right
> > > now.
> > > I'm not even sure if there is a sensible way I could use it,
> > > since as I
> > > said we can't exactly abort jobs. I know there are ways to lock
> > > up the
> > > firmware/GPU, but so far those have all been things the kernel
> > > driver
> > > can prevent, and I'm not even sure if there is any way to recover
> > > from
> > > that anyway. The firmware itself has its own timeouts and
> > > recovery for
> > > "normal" problems. From the point of view of the driver and
> > > everything
> > > above it, in-flight commands during a GPU fault or timeout are
> > > just
> > > marked complete by the firmware, after a firmware recovery cycle
> > > where
> > > the driver gets notified of the problem (that's when we mark the
> > > commands failed so we can propagate the error).
> > 
> > Yeah, that's exactly what we are telling our fw people for years
> > that we 
> > need this as well.
> 
> Yeah, the ugly bit is that the firmware does a full GPU recovery even
> on
> simple page faults (which could be handled more gracefully) so even
> stuff like that can possibly break concurrent GPU work.
> 
> On the other hand, macOS configures things so page faults are ignored
> and silently return all-00 on reads for shader accesses, which is how
> they implement sparse buffers/textures... and we'll probably have to
> do
> that to improve reliability against app faults if nothing else. But
> right now the driver enables explicit page faults for everything so
> we
> can debug Mesa (it's a kernel module param, GPU global and I haven't
> found a way to change it after initial load unfortunately, but it
> might
> be possible).
> 
> I think there's also a way to do actual page fault handling (like
> swap
> in pages and resume the GPU), but that's one of those firmware
> features
> Apple's driver just never uses as far as I can tell. There's so much
> unexplored territory...
> 
> > 
> > > There is no re-submission or anything, userspace just gets told
> > > of the problem but
> > > the queue survives.
> > 
> > > In the future it might be possible to re-submit innocent commands
> > 
> > Long story short: Don't do this! This is what the Windows drivers
> > have 
> > been doing and it creates tons of problems.

Yeah, we tried to do a bit of that in the GL days.  It was a bad idea.

> > Just signal the problem back to userspace and let the user space
> > driver 
> > decide what to do.
> > 
> > The background is that most graphics applications (games etc..)
> > then 
> > rather start on the next frame instead of submitting the current
> > one 
> > again while compute applications make sure that the abort and tell
> > the 
> > user that the calculations might be corrupted and need to be
> > redone.

The guarantee that Vulkan makes is that, if you idle the GPU and you
haven't gotten a DEVICE_LOST yet, your data is good.  If you get a
DEVICE_LOST, all bets are off.  The problem is that, no matter how fast
the error propagation may be in the kernel or userspace driver, errors
can still show up in strange ways.  An OOB buffer access could end up
modifying a shader binary which gets run 3 frames later and causes a
corruption.  Once you've faulted, you really have no idea how far back
is good or what memory is corrupted.  You have to assume that
everything mapped to the GPU VA space is potentially toast.

> Then we're good with what we're currently doing, since we already
> notify
> userspace like that!
> 
> Actually I wanted to ask about error notifications. Right now we have
> an
> out-of-band mechanism to provide detailed fault info to userspace
> which
> works fine, but in principle it's optional.

This is fine, in principal.  Because of the nature of errors, async is
fine as long as the error shows up eventually.  Faster is better, for
sure, but error latency doesn't really matter in practice.

> However, I also mark the hw
>  fences as errored when a fault happens (with an errno that describes
> the overall situation), but that never makes it into the drm_sched
> job
> complete fence. I looked at the drm_sched code and I didn't see any
> error propagation. Is that supposed to work, or am I supposed to
> directly mark the drm_sched side fence as complete, or did I
> misunderstand all this? I get the feeling maybe existing drivers just
> rely on the recovery/timeout/etc paths to mark jobs as errored (since
> those do it explicitly) and never need error forwarding from the hw
> fence?

The end behavior needs to be that all fences for all jobs submitted to
the queue get signaled.  That's needed to satisfy the finite time
guarantees of dma_fence.  Exactly how that happens (let the job run,
abort all the jobs, etc.) is an implementation detail for the driver to
decide.  If you want, you can also set a bit on the context (or queue)
to mark it as dead and start returning EIO or similar from any ioctls
trying to submit more work if you wanted.  Not required but you can.

~Faith