[Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

Fri Apr 30 08:58:22 UTC 2021

On Thu, Apr 29, 2021 at 1:12 PM Daniel Vetter <daniel at ffwll.ch> wrote:
>
> On Wed, Apr 28, 2021 at 04:39:24PM -0400, Alex Deucher wrote:
> > On Wed, Apr 28, 2021 at 10:35 AM Daniel Vetter <daniel at ffwll.ch> wrote:
> > >
> > > On Wed, Apr 28, 2021 at 03:37:49PM +0200, Christian König wrote:
> > > > Am 28.04.21 um 15:34 schrieb Daniel Vetter:
> > > > > On Wed, Apr 28, 2021 at 03:11:27PM +0200, Christian König wrote:
> > > > > > Am 28.04.21 um 14:26 schrieb Daniel Vetter:
> > > > > > > On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:
> > > > > > > > On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
> > > > > > > > > Am 28.04.21 um 12:05 schrieb Daniel Vetter:
> > > > > > > > > > On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> > > > > > > > > > > On Tue, Apr 27, 2021 at 1:35 PM Simon Ser <contact at emersion.fr> wrote:
> > > > > > > > > > > > On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach <l.stach at pengutronix.de> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > > Ok. So that would only make the following use cases broken for now:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > - amd render -> external gpu
> > > > > > > > > > > > > > - amd video encode -> network device
> > > > > > > > > > > > > FWIW, "only" breaking amd render -> external gpu will make us pretty
> > > > > > > > > > > > > unhappy
> > > > > > > > > > > > I concur. I have quite a few users with a multi-GPU setup involving
> > > > > > > > > > > > AMD hardware.
> > > > > > > > > > > >
> > > > > > > > > > > > Note, if this brokenness can't be avoided, I'd prefer a to get a clear
> > > > > > > > > > > > error, and not bad results on screen because nothing is synchronized
> > > > > > > > > > > > anymore.
> > > > > > > > > > > It's an upcoming requirement for windows[1], so you are likely to
> > > > > > > > > > > start seeing this across all GPU vendors that support windows.  I
> > > > > > > > > > > think the timing depends on how quickly the legacy hardware support
> > > > > > > > > > > sticks around for each vendor.
> > > > > > > > > > Yeah but hw scheduling doesn't mean the hw has to be constructed to not
> > > > > > > > > > support isolating the ringbuffer at all.
> > > > > > > > > >
> > > > > > > > > > E.g. even if the hw loses the bit to put the ringbuffer outside of the
> > > > > > > > > > userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
> > > > > > > > > > pte flags. Otherwise the entire "share address space with cpu side,
> > > > > > > > > > seamlessly" thing is out of the window.
> > > > > > > > > >
> > > > > > > > > > And with that r/o bit on the ringbuffer you can once more force submit
> > > > > > > > > > through kernel space, and all the legacy dma_fence based stuff keeps
> > > > > > > > > > working. And we don't have to invent some horrendous userspace fence based
> > > > > > > > > > implicit sync mechanism in the kernel, but can instead do this transition
> > > > > > > > > > properly with drm_syncobj timeline explicit sync and protocol reving.
> > > > > > > > > >
> > > > > > > > > > At least I think you'd have to work extra hard to create a gpu which
> > > > > > > > > > cannot possibly be intercepted by the kernel, even when it's designed to
> > > > > > > > > > support userspace direct submit only.
> > > > > > > > > >
> > > > > > > > > > Or are your hw engineers more creative here and we're screwed?
> > > > > > > > > The upcomming hardware generation will have this hardware scheduler as a
> > > > > > > > > must have, but there are certain ways we can still stick to the old
> > > > > > > > > approach:
> > > > > > > > >
> > > > > > > > > 1. The new hardware scheduler currently still supports kernel queues which
> > > > > > > > > essentially is the same as the old hardware ring buffer.
> > > > > > > > >
> > > > > > > > > 2. Mapping the top level ring buffer into the VM at least partially solves
> > > > > > > > > the problem. This way you can't manipulate the ring buffer content, but the
> > > > > > > > > location for the fence must still be writeable.
> > > > > > > > Yeah allowing userspace to lie about completion fences in this model is
> > > > > > > > ok. Though I haven't thought through full consequences of that, but I
> > > > > > > > think it's not any worse than userspace lying about which buffers/address
> > > > > > > > it uses in the current model - we rely on hw vm ptes to catch that stuff.
> > > > > > > >
> > > > > > > > Also it might be good to switch to a non-recoverable ctx model for these.
> > > > > > > > That's already what we do in i915 (opt-in, but all current umd use that
> > > > > > > > mode). So any hang/watchdog just kills the entire ctx and you don't have
> > > > > > > > to worry about userspace doing something funny with it's ringbuffer.
> > > > > > > > Simplifies everything.
> > > > > > > >
> > > > > > > > Also ofc userspace fencing still disallowed, but since userspace would
> > > > > > > > queu up all writes to its ringbuffer through the drm/scheduler, we'd
> > > > > > > > handle dependencies through that still. Not great, but workable.
> > > > > > > >
> > > > > > > > Thinking about this, not even mapping the ringbuffer r/o is required, it's
> > > > > > > > just that we must queue things throug the kernel to resolve dependencies
> > > > > > > > and everything without breaking dma_fence. If userspace lies, tdr will
> > > > > > > > shoot it and the kernel stops running that context entirely.
> > > > > > Thinking more about that approach I don't think that it will work correctly.
> > > > > >
> > > > > > See we not only need to write the fence as signal that an IB is submitted,
> > > > > > but also adjust a bunch of privileged hardware registers.
> > > > > >
> > > > > > When userspace could do that from its IBs as well then there is nothing
> > > > > > blocking it from reprogramming the page table base address for example.
> > > > > >
> > > > > > We could do those writes with the CPU as well, but that would be a huge
> > > > > > performance drop because of the additional latency.
> > > > > That's not what I'm suggesting. I'm suggesting you have the queue and
> > > > > everything in userspace, like in wondows. Fences are exactly handled like
> > > > > on windows too. The difference is:
> > > > >
> > > > > - All new additions to the ringbuffer are done through a kernel ioctl
> > > > >    call, using the drm/scheduler to resolve dependencies.
> > > > >
> > > > > - Memory management is also done like today int that ioctl.
> > > > >
> > > > > - TDR makes sure that if userspace abuses the contract (which it can, but
> > > > >    it can do that already today because there's also no command parser to
> > > > >    e.g. stop gpu semaphores) the entire context is shot and terminally
> > > > >    killed. Userspace has to then set up a new one. This isn't how amdgpu
> > > > >    recovery works right now, but i915 supports it and I think it's also the
> > > > >    better model for userspace error recovery anyway.
> > > > >
> > > > > So from hw pov this will look _exactly_ like windows, except we never page
> > > > > fault.
> > > > >
> > > > >  From sw pov this will look _exactly_ like current kernel ringbuf model,
> > > > > with exactly same dma_fence semantics. If userspace lies, does something
> > > > > stupid or otherwise breaks the uapi contract, vm ptes stop invalid access
> > > > > and tdr kills it if it takes too long.
> > > > >
> > > > > Where do you need priviledge IB writes or anything like that?
> > > >
> > > > For writing the fence value and setting up the priority and VM registers.
> > >
> > > I'm confused. How does this work on windows then with pure userspace
> > > submit? Windows userspace sets its priorties and vm registers itself from
> > > userspace?
> >
> > When the user allocates usermode queues, the kernel driver sets up a
> > queue descriptor in the kernel which defines the location of the queue
> > in memory, what priority it has, what page tables it should use, etc.
> > User mode can then start writing commands to its queues.  When they
> > are ready for the hardware to start executing them, they ring a
> > doorbell which signals the scheduler and it maps the queue descriptors
> > to HW queue slots and they start executing.  The user only has access
> > to it's queues and any buffers it has mapped in it's GPU virtual
> > address space.  While the queues are scheduled, the user can keep
> > submitting work to them and they will keep executing unless they get
> > preempted by the scheduler due to oversubscription or a priority call
> > or a request from the kernel driver to preempt, etc.
>
> Yeah, works like with our stuff.
>
> I don't see a problem tbh. It's slightly silly going the detour with the
> kernel ioctl, and it's annoying that you still have to use drm/scheduler
> to resolve dependencies instead of gpu semaphores and all that. But this
> only applies to legacy winsys mode, compute (e.g. vk without winsys) can
> use the full power. Just needs a flag or something when setting up the
> context.
>
> And best part is that from hw pov this really is indistinguishable from
> the full on userspace submit model.
>
> The thing where it gets annoying is when you use one of these new cpu
> instructions which do direct submit to hw and pass along the pasid id
> behind the scenes. That's truly something you can't intercept anymore in
> the kernel and fake the legacy dma_fence world.
>
> But what you're describing here sounds like bog standard stuff, and also
> pretty easy to keep working with exactly the current model.
>
> Ofc we'll want to push forward a more modern model that better suits
> modern gpus, but I don't see any hard requirement here from the hw side.

Adding a bit more detail on what I have in mind:

- memory management works like amdgpu does today, so all buffers are
pre-bound to the gpu vm, we keep the entire bo set marked as busy with
the bulk lru trick for every command submission.

- for the ringbuffer, userspace allcoates a suitably sized bo for
ringbuffer, ring/tail/seqno and whatever else it needs

- userspace then asks the kernel to make that into a hw context, with
all the priviledges setup. Doorbell will only be mapped into kernel
(hw can't tell the difference anyway), but if it happens to also be
visible to userspace that's no problem. We assume userspace can ring
the doorbell anytime it wants to.

- we do double memory management: One dma_fence works similar to the
amdkfd preempt fence, except it doesn't preempt but does anything
required to make the hw context unrunnable and take it out of the hw
scheduler entirely. This might involve unmapping the doorbell if
userspace has access to it.

- but we also do classic end-of-batch fences, so that implicit fencing
and all that keeps working. The "make hw ctx unrunnable" fence must
also wait for all of these pending submissions to complete.

- for the actual end-of-batchbuffer dma_fence it's almost all faked,
but with some checks in the kernel to keep up the guarantees. cs flow
is roughly

1. userspace directly writes into the userspace ringbuffer. It needs
to follow the kernel's rule for this if it wants things to work
correctly, but we assume evil userspace is allowed to write whatever
it wants to the ring, and change that whenever it wants. Userspace
does not update ring head/tail pointers.

2. cs ioctl just contains: a) head (the thing userspace advances, tail
is where the gpu consumes) pointer value to write to kick of this new
batch b) in-fences b) out-fence.

3. kernel drm/scheduler handles this like any other request and first
waits for the in-fences to all signal, then it executes the CS. For
execution it simply writes the provided head value into the ring's
metadata, and rings the doorbells. No checks. We assume userspace can
update the tail whenever it feels like, so checking the head value is
pointless anyway.

4. the entire correctness is only depending upon the dma_fences
working as they should. For that we need some very strict rules on
when the end-of-batchbuffer dma_fence signals:
- the drm/scheduler must have marked the request as runnable already,
i.e. all dependencies are fullfilled. This is to prevent the fences
from signalling in the wrong order.
- the fence from the previous batch must have signalled already, again
to guarantee in-order signalling (even if userspace does something
stupid and reorders how things complete)
- the fence must never jump back to unsignalled, so the lockless
fastpath that just checks the seqno is a no-go

5. if drm/scheduler tdr decides it's taking too long we throw the
entire context away, forbit further command submission on it (through
the ioctl, userspace can keep writing to the ring whatever it wants)
and fail all in-flight buffers with an error. Non-evil userspace can
then recover by re-creating a new ringbuffer with everything.

I've pondered this now for a bit and I really can't spot the holes.
And I think it should all work, both for hw and kernel/legacy
dma_fence use-case.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch