[Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

Fri Apr 30 09:35:02 UTC 2021

On Fri, Apr 30, 2021 at 11:08 AM Christian König
<ckoenig.leichtzumerken at gmail.com> wrote:
>
> Am 30.04.21 um 10:58 schrieb Daniel Vetter:
> > [SNIP]
> >>> When the user allocates usermode queues, the kernel driver sets up a
> >>> queue descriptor in the kernel which defines the location of the queue
> >>> in memory, what priority it has, what page tables it should use, etc.
> >>> User mode can then start writing commands to its queues.  When they
> >>> are ready for the hardware to start executing them, they ring a
> >>> doorbell which signals the scheduler and it maps the queue descriptors
> >>> to HW queue slots and they start executing.  The user only has access
> >>> to it's queues and any buffers it has mapped in it's GPU virtual
> >>> address space.  While the queues are scheduled, the user can keep
> >>> submitting work to them and they will keep executing unless they get
> >>> preempted by the scheduler due to oversubscription or a priority call
> >>> or a request from the kernel driver to preempt, etc.
> >> Yeah, works like with our stuff.
> >>
> >> I don't see a problem tbh. It's slightly silly going the detour with the
> >> kernel ioctl, and it's annoying that you still have to use drm/scheduler
> >> to resolve dependencies instead of gpu semaphores and all that. But this
> >> only applies to legacy winsys mode, compute (e.g. vk without winsys) can
> >> use the full power. Just needs a flag or something when setting up the
> >> context.
> >>
> >> And best part is that from hw pov this really is indistinguishable from
> >> the full on userspace submit model.
> >>
> >> The thing where it gets annoying is when you use one of these new cpu
> >> instructions which do direct submit to hw and pass along the pasid id
> >> behind the scenes. That's truly something you can't intercept anymore in
> >> the kernel and fake the legacy dma_fence world.
> >>
> >> But what you're describing here sounds like bog standard stuff, and also
> >> pretty easy to keep working with exactly the current model.
> >>
> >> Ofc we'll want to push forward a more modern model that better suits
> >> modern gpus, but I don't see any hard requirement here from the hw side.
> > Adding a bit more detail on what I have in mind:
> >
> > - memory management works like amdgpu does today, so all buffers are
> > pre-bound to the gpu vm, we keep the entire bo set marked as busy with
> > the bulk lru trick for every command submission.
> >
> > - for the ringbuffer, userspace allcoates a suitably sized bo for
> > ringbuffer, ring/tail/seqno and whatever else it needs
> >
> > - userspace then asks the kernel to make that into a hw context, with
> > all the priviledges setup. Doorbell will only be mapped into kernel
> > (hw can't tell the difference anyway), but if it happens to also be
> > visible to userspace that's no problem. We assume userspace can ring
> > the doorbell anytime it wants to.
>
> This doesn't work in hardware. We at least need to setup a few registers
> and memory locations from inside the VM which userspace shouldn't have
> access to when we want the end of batch fence and ring buffer start to
> be reliable.

The thing is, we don't care whether it's reliable or not. Userspace is
allowed to lie, not signal, signal the wrong thing, out of order,
everything.

The design assumes all this is possible.

So unless you can't signal at all from userspace, this works. And for
the "can't signal at all" it just means something needs to do a cpu
busy wait and burn down lots of cpu time. I hope that's not your hw
design :-)

> > - we do double memory management: One dma_fence works similar to the
> > amdkfd preempt fence, except it doesn't preempt but does anything
> > required to make the hw context unrunnable and take it out of the hw
> > scheduler entirely. This might involve unmapping the doorbell if
> > userspace has access to it.
> >
> > - but we also do classic end-of-batch fences, so that implicit fencing
> > and all that keeps working. The "make hw ctx unrunnable" fence must
> > also wait for all of these pending submissions to complete.
>
> This together doesn't work from the software side, e.g. you can either
> have preemption fences or end of batch fences but never both or your end
> of batch fences would have another dependency on the preemption fences
> which we currently can't express in the dma_fence framework.

It's _not_ a preempt fence. It's an ctx unload fence. Not the same
thing. Normal preempt fence would indeed fail.

> Additional to that it can't work from the hardware side because we have
> a separation between engine and scheduler on the hardware side. So we
> can't reliable get a signal inside the kernel that a batch has completed.
>
> What we could do is to get this signal in userspace, e.g. userspace
> inserts the packets into the ring buffer and then the kernel can read
> the fence value and get the IV.
>
> But this has the same problem as user fences because it requires the
> cooperation of userspace.

Nope. Read the thing again, I'm assuming that userspace lies. The
kernel's dma_fence code compensates for that.

Also note that userspace can already lie to it's heart's content with
the current IB stuff. You are already allowed to hang the gpu, submit
utter garbage, render to the wrong buffer or just scribble all over
your own IB. This isn't a new problem.

> We just yesterday had a meeting with the firmware developers to discuss
> the possible options and I now have even stronger doubts that this is
> doable.
>
> We either have user queues where userspace writes the necessary commands
> directly to the ring buffer or we have kernel queues. A mixture of both
> isn't supported in neither the hardware nor the firmware.

Yup. Please read my thing again carefully, I'm stating that userspace
writes all the necessary commands directly into the ringbuffer.

The kernel writes _nothing_ into the ringbuffer. The only thing it
does is update the head pointer to unblock that next section of the
ring, when drm/scheduler thinks that's ok to do.

This works, you just thinking of something completely different than
what I write down :-)

Cheers, Daniel

>
> Regards,
> Christian.
>
> >
> > - for the actual end-of-batchbuffer dma_fence it's almost all faked,
> > but with some checks in the kernel to keep up the guarantees. cs flow
> > is roughly
> >
> > 1. userspace directly writes into the userspace ringbuffer. It needs
> > to follow the kernel's rule for this if it wants things to work
> > correctly, but we assume evil userspace is allowed to write whatever
> > it wants to the ring, and change that whenever it wants. Userspace
> > does not update ring head/tail pointers.
> >
> > 2. cs ioctl just contains: a) head (the thing userspace advances, tail
> > is where the gpu consumes) pointer value to write to kick of this new
> > batch b) in-fences b) out-fence.
> >
> > 3. kernel drm/scheduler handles this like any other request and first
> > waits for the in-fences to all signal, then it executes the CS. For
> > execution it simply writes the provided head value into the ring's
> > metadata, and rings the doorbells. No checks. We assume userspace can
> > update the tail whenever it feels like, so checking the head value is
> > pointless anyway.
> >
> > 4. the entire correctness is only depending upon the dma_fences
> > working as they should. For that we need some very strict rules on
> > when the end-of-batchbuffer dma_fence signals:
> > - the drm/scheduler must have marked the request as runnable already,
> > i.e. all dependencies are fullfilled. This is to prevent the fences
> > from signalling in the wrong order.
> > - the fence from the previous batch must have signalled already, again
> > to guarantee in-order signalling (even if userspace does something
> > stupid and reorders how things complete)
> > - the fence must never jump back to unsignalled, so the lockless
> > fastpath that just checks the seqno is a no-go
> >
> > 5. if drm/scheduler tdr decides it's taking too long we throw the
> > entire context away, forbit further command submission on it (through
> > the ioctl, userspace can keep writing to the ring whatever it wants)
> > and fail all in-flight buffers with an error. Non-evil userspace can
> > then recover by re-creating a new ringbuffer with everything.
> >
> > I've pondered this now for a bit and I really can't spot the holes.
> > And I think it should all work, both for hw and kernel/legacy
> > dma_fence use-case.
> > -Daniel
>

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch