[Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

Fri Apr 30 09:07:58 UTC 2021

Am 30.04.21 um 10:58 schrieb Daniel Vetter:
> [SNIP]
>>> When the user allocates usermode queues, the kernel driver sets up a
>>> queue descriptor in the kernel which defines the location of the queue
>>> in memory, what priority it has, what page tables it should use, etc.
>>> User mode can then start writing commands to its queues.  When they
>>> are ready for the hardware to start executing them, they ring a
>>> doorbell which signals the scheduler and it maps the queue descriptors
>>> to HW queue slots and they start executing.  The user only has access
>>> to it's queues and any buffers it has mapped in it's GPU virtual
>>> address space.  While the queues are scheduled, the user can keep
>>> submitting work to them and they will keep executing unless they get
>>> preempted by the scheduler due to oversubscription or a priority call
>>> or a request from the kernel driver to preempt, etc.
>> Yeah, works like with our stuff.
>>
>> I don't see a problem tbh. It's slightly silly going the detour with the
>> kernel ioctl, and it's annoying that you still have to use drm/scheduler
>> to resolve dependencies instead of gpu semaphores and all that. But this
>> only applies to legacy winsys mode, compute (e.g. vk without winsys) can
>> use the full power. Just needs a flag or something when setting up the
>> context.
>>
>> And best part is that from hw pov this really is indistinguishable from
>> the full on userspace submit model.
>>
>> The thing where it gets annoying is when you use one of these new cpu
>> instructions which do direct submit to hw and pass along the pasid id
>> behind the scenes. That's truly something you can't intercept anymore in
>> the kernel and fake the legacy dma_fence world.
>>
>> But what you're describing here sounds like bog standard stuff, and also
>> pretty easy to keep working with exactly the current model.
>>
>> Ofc we'll want to push forward a more modern model that better suits
>> modern gpus, but I don't see any hard requirement here from the hw side.
> Adding a bit more detail on what I have in mind:
>
> - memory management works like amdgpu does today, so all buffers are
> pre-bound to the gpu vm, we keep the entire bo set marked as busy with
> the bulk lru trick for every command submission.
>
> - for the ringbuffer, userspace allcoates a suitably sized bo for
> ringbuffer, ring/tail/seqno and whatever else it needs
>
> - userspace then asks the kernel to make that into a hw context, with
> all the priviledges setup. Doorbell will only be mapped into kernel
> (hw can't tell the difference anyway), but if it happens to also be
> visible to userspace that's no problem. We assume userspace can ring
> the doorbell anytime it wants to.

This doesn't work in hardware. We at least need to setup a few registers 
and memory locations from inside the VM which userspace shouldn't have 
access to when we want the end of batch fence and ring buffer start to 
be reliable.

> - we do double memory management: One dma_fence works similar to the
> amdkfd preempt fence, except it doesn't preempt but does anything
> required to make the hw context unrunnable and take it out of the hw
> scheduler entirely. This might involve unmapping the doorbell if
> userspace has access to it.
>
> - but we also do classic end-of-batch fences, so that implicit fencing
> and all that keeps working. The "make hw ctx unrunnable" fence must
> also wait for all of these pending submissions to complete.

This together doesn't work from the software side, e.g. you can either 
have preemption fences or end of batch fences but never both or your end 
of batch fences would have another dependency on the preemption fences 
which we currently can't express in the dma_fence framework.

Additional to that it can't work from the hardware side because we have 
a separation between engine and scheduler on the hardware side. So we 
can't reliable get a signal inside the kernel that a batch has completed.

What we could do is to get this signal in userspace, e.g. userspace 
inserts the packets into the ring buffer and then the kernel can read 
the fence value and get the IV.

But this has the same problem as user fences because it requires the 
cooperation of userspace.

We just yesterday had a meeting with the firmware developers to discuss 
the possible options and I now have even stronger doubts that this is 
doable.

We either have user queues where userspace writes the necessary commands 
directly to the ring buffer or we have kernel queues. A mixture of both 
isn't supported in neither the hardware nor the firmware.

Regards,
Christian.

>
> - for the actual end-of-batchbuffer dma_fence it's almost all faked,
> but with some checks in the kernel to keep up the guarantees. cs flow
> is roughly
>
> 1. userspace directly writes into the userspace ringbuffer. It needs
> to follow the kernel's rule for this if it wants things to work
> correctly, but we assume evil userspace is allowed to write whatever
> it wants to the ring, and change that whenever it wants. Userspace
> does not update ring head/tail pointers.
>
> 2. cs ioctl just contains: a) head (the thing userspace advances, tail
> is where the gpu consumes) pointer value to write to kick of this new
> batch b) in-fences b) out-fence.
>
> 3. kernel drm/scheduler handles this like any other request and first
> waits for the in-fences to all signal, then it executes the CS. For
> execution it simply writes the provided head value into the ring's
> metadata, and rings the doorbells. No checks. We assume userspace can
> update the tail whenever it feels like, so checking the head value is
> pointless anyway.
>
> 4. the entire correctness is only depending upon the dma_fences
> working as they should. For that we need some very strict rules on
> when the end-of-batchbuffer dma_fence signals:
> - the drm/scheduler must have marked the request as runnable already,
> i.e. all dependencies are fullfilled. This is to prevent the fences
> from signalling in the wrong order.
> - the fence from the previous batch must have signalled already, again
> to guarantee in-order signalling (even if userspace does something
> stupid and reorders how things complete)
> - the fence must never jump back to unsignalled, so the lockless
> fastpath that just checks the seqno is a no-go
>
> 5. if drm/scheduler tdr decides it's taking too long we throw the
> entire context away, forbit further command submission on it (through
> the ioctl, userspace can keep writing to the ring whatever it wants)
> and fail all in-flight buffers with an error. Non-evil userspace can
> then recover by re-creating a new ringbuffer with everything.
>
> I've pondered this now for a bit and I really can't spot the holes.
> And I think it should all work, both for hw and kernel/legacy
> dma_fence use-case.
> -Daniel