[Mesa-dev] Linux Graphics Next: Userspace submission update

Fri Jun 4 11:27:15 UTC 2021

Am 04.06.21 um 10:57 schrieb Daniel Vetter:
> On Fri, Jun 04, 2021 at 09:00:31AM +0200, Christian König wrote:
>> Am 02.06.21 um 21:19 schrieb Daniel Vetter:
>>> On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote:
>>>> Am 02.06.21 um 20:48 schrieb Daniel Vetter:
>>>>> On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
>>>>>> On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák <maraeo at gmail.com> wrote:
>>>>>>
>>>>>>> Yes, we can't break anything because we don't want to complicate things
>>>>>>> for us. It's pretty much all NAK'd already. We are trying to gather more
>>>>>>> knowledge and then make better decisions.
>>>>>>>
>>>>>>> The idea we are considering is that we'll expose memory-based sync objects
>>>>>>> to userspace for read only, and the kernel or hw will strictly control the
>>>>>>> memory writes to those sync objects. The hole in that idea is that
>>>>>>> userspace can decide not to signal a job, so even if userspace can't
>>>>>>> overwrite memory-based sync object states arbitrarily, it can still decide
>>>>>>> not to signal them, and then a future fence is born.
>>>>>>>
>>>>>> This would actually be treated as a GPU hang caused by that context, so it
>>>>>> should be fine.
>>>>> This is practically what I proposed already, except your not doing it with
>>>>> dma_fence. And on the memory fence side this also doesn't actually give
>>>>> what you want for that compute model.
>>>>>
>>>>> This seems like a bit a worst of both worlds approach to me? Tons of work
>>>>> in the kernel to hide these not-dma_fence-but-almost, and still pain to
>>>>> actually drive the hardware like it should be for compute or direct
>>>>> display.
>>>>>
>>>>> Also maybe I've missed it, but I didn't see any replies to my suggestion
>>>>> how to fake the entire dma_fence stuff on top of new hw. Would be
>>>>> interesting to know what doesn't work there instead of amd folks going of
>>>>> into internal again and then coming back with another rfc from out of
>>>>> nowhere :-)
>>>> Well to be honest I would just push back on our hardware/firmware guys that
>>>> we need to keep kernel queues forever before going down that route.
>>> I looked again, and you said the model wont work because preemption is way
>>> too slow, even when the context is idle.
>>>
>>> I guess at that point I got maybe too fed up and just figured "not my
>>> problem", but if preempt is too slow as the unload fence, you can do it
>>> with pte removal and tlb shootdown too (that is hopefully not too slow,
>>> otherwise your hw is just garbage and wont even be fast for direct submit
>>> compute workloads).
>> Have you seen that one here:
>> https://www.spinics.net/lists/amd-gfx/msg63101.html :)
>>
>> I've rejected it because I think polling for 6 seconds on a TLB flush which
>> can block interrupts as well is just madness.
> Hm but I thought you had like 2 tlb flush modes, the shitty one (with
> retrying page faults) and the not so shitty one?

Yeah, we call this the lightweight and the heavyweight tlb flush.

The lighweight can be used when you are sure that you don't have any of 
the PTEs currently in flight in the 3D/DMA engine and you just need to 
invalidate the TLB.

The heavyweight must be used when you need to invalidate the TLB *AND* 
make sure that no concurrently operation moves new stuff into the TLB.

The problem is for this use case we have to use the heavyweight one.

> But yeah at that point I think you just have to bite one of the bullets.

Yeah, completely agree. We can choose which way we want to die, but it's 
certainly not going to be nice whatever we do.

>
> The thing is with hmm/userspace memory fence model this will be even
> worse, because you will _have_ to do this tlb flush deep down in core mm
> functions, so this is going to be userptr, but worse.
>
> With the dma_resv/dma_fence bo memory management model you can at least
> wrap that tlb flush into a dma_fence and push the waiting/pinging onto a
> separate thread or something like that. If the hw really is that slow.
>
> Somewhat aside: Personally I think that sriov needs to move over to the
> compute model, i.e. indefinite timeouts, no tdr, because everything takes
> too long. At least looking around sriov timeouts tend to be 10x bare
> metal, across the board.
>
> But for stuff like cloud gaming that's serious amounts of heavy lifting
> since it brings us right back "the entire linux/android 3d stack is built
> on top of dma_fence right now".
>
>>> The only thing that you need to do when you use pte clearing + tlb
>>> shootdown instad of preemption as the unload fence for buffers that get
>>> moved is that if you get any gpu page fault, you don't serve that, but
>>> instead treat it as a tdr and shot the context permanently.
>>>
>>> So summarizing the model I proposed:
>>>
>>> - you allow userspace to directly write into the ringbuffer, and also
>>>     write the fences directly
>>>
>>> - actual submit is done by the kernel, using drm/scheduler. The kernel
>>>     blindly trusts userspace to set up everything else, and even just wraps
>>>     dma_fences around the userspace memory fences.
>>>
>>> - the only check is tdr. If a fence doesn't complete an tdr fires, a) the
>>>     kernel shot the entire context and b) userspace recovers by setting up a
>>>     new ringbuffer
>>>
>>> - memory management is done using ttm only, you still need to supply the
>>>     buffer list (ofc that list includes the always present ones, so CS will
>>>     only get the list of special buffers like today). If you hw can't trun
>>>     gpu page faults and you ever get one we pull up the same old solution:
>>>     Kernel shots the entire context.
>>>
>>>     The important thing is that from the gpu pov memory management works
>>>     exactly like compute workload with direct submit, except that you just
>>>     terminate the context on _any_ page fault, instead of only those that go
>>>     somewhere where there's really no mapping and repair the others.
>>>
>>>     Also I guess from reading the old thread this means you'd disable page
>>>     fault retry because that is apparently also way too slow for anything.
>>>
>>> - memory management uses an unload fence. That unload fences waits for all
>>>     userspace memory fences (represented as dma_fence) to complete, with
>>>     maybe some fudge to busy-spin until we've reached the actual end of the
>>>     ringbuffer (maybe you have a IB tail there after the memory fence write,
>>>     we have that on intel hw), and it waits for the memory to get
>>>     "unloaded". This is either preemption, or pte clearing + tlb shootdown,
>>>     or whatever else your hw provides which is a) used for dynamic memory
>>>     management b) fast enough for actual memory management.
>>>
>>> - any time a context dies we force-complete all it's pending fences,
>>>     in-order ofc
>>>
>>> So from hw pov this looks 99% like direct userspace submit, with the exact
>>> same mappings, command sequences and everything else. The only difference
>>> is that the rinbuffer head/tail updates happen from drm/scheduler, instead
>>> of directly from userspace.
>>>
>>> None of this stuff needs funny tricks where the kernel controls the
>>> writes to memory fences, or where you need kernel ringbuffers, or anything
>>> like thist. Userspace is allowed to do anything stupid, the rules are
>>> guaranteed with:
>>>
>>> - we rely on the hw isolation features to work, but _exactly_ like compute
>>>     direct submit would too
>>>
>>> - dying on any page fault captures memory management issues
>>>
>>> - dying (without kernel recover, this is up to userspace if it cares) on
>>>     any tdr makes sure fences complete still
>>>
>>>> That syncfile and all that Android stuff isn't working out of the box with
>>>> the new shiny user queue submission model (which in turn is mostly because
>>>> of Windows) already raised some eyebrows here.
>>> I think if you really want to make sure the current linux stack doesn't
>>> break the _only_ option you have is provide a ctx mode that allows
>>> dma_fence and drm/scheduler to be used like today.
>> Yeah, but I still can just tell our hw/fw guys that we really really need to
>> keep kernel queues or the whole Linux/Android infrastructure needs to get a
>> compatibility layer like you describe above.
>>
>>> For everything else it sounds you're a few years too late, because even
>>> just huge kernel changes wont happen in time. Much less rewriting
>>> userspace protocols.
>> Seconded, question is rather if we are going to start migrating at some
>> point or if we should keep pushing on our hw/fw guys.
> So from what I'm hearing other hw might gain the sw compat layer too. Plus
> I'm hoping that with the sw compat layer it'd be easier to smooth over
> userspace to the new model (because there will be a long time where we
> have to support both, maybe even with a runtime switch from userspace
> memory fences to dma_fence kernel stuff).
>
> But in the end it's up to you what makes more sense between sw work and
> hw/fw work involved.

I'm currently entertaining my head a bit with the idea of implementing 
the HW scheduling on the CPU.

The only obstacle that I can really see is that we might need to unmap a 
page in the CPU page table when a queue becomes idle.

But apart from that it would give us the required functionality, just 
without the hardware scheduler.

Christian.

> -Daniel