[Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

Wed Apr 28 13:37:49 UTC 2021

Am 28.04.21 um 15:34 schrieb Daniel Vetter:
> On Wed, Apr 28, 2021 at 03:11:27PM +0200, Christian König wrote:
>> Am 28.04.21 um 14:26 schrieb Daniel Vetter:
>>> On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:
>>>> On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
>>>>> Am 28.04.21 um 12:05 schrieb Daniel Vetter:
>>>>>> On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
>>>>>>> On Tue, Apr 27, 2021 at 1:35 PM Simon Ser <contact at emersion.fr> wrote:
>>>>>>>> On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach <l.stach at pengutronix.de> wrote:
>>>>>>>>
>>>>>>>>>> Ok. So that would only make the following use cases broken for now:
>>>>>>>>>>
>>>>>>>>>> - amd render -> external gpu
>>>>>>>>>> - amd video encode -> network device
>>>>>>>>> FWIW, "only" breaking amd render -> external gpu will make us pretty
>>>>>>>>> unhappy
>>>>>>>> I concur. I have quite a few users with a multi-GPU setup involving
>>>>>>>> AMD hardware.
>>>>>>>>
>>>>>>>> Note, if this brokenness can't be avoided, I'd prefer a to get a clear
>>>>>>>> error, and not bad results on screen because nothing is synchronized
>>>>>>>> anymore.
>>>>>>> It's an upcoming requirement for windows[1], so you are likely to
>>>>>>> start seeing this across all GPU vendors that support windows.  I
>>>>>>> think the timing depends on how quickly the legacy hardware support
>>>>>>> sticks around for each vendor.
>>>>>> Yeah but hw scheduling doesn't mean the hw has to be constructed to not
>>>>>> support isolating the ringbuffer at all.
>>>>>>
>>>>>> E.g. even if the hw loses the bit to put the ringbuffer outside of the
>>>>>> userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
>>>>>> pte flags. Otherwise the entire "share address space with cpu side,
>>>>>> seamlessly" thing is out of the window.
>>>>>>
>>>>>> And with that r/o bit on the ringbuffer you can once more force submit
>>>>>> through kernel space, and all the legacy dma_fence based stuff keeps
>>>>>> working. And we don't have to invent some horrendous userspace fence based
>>>>>> implicit sync mechanism in the kernel, but can instead do this transition
>>>>>> properly with drm_syncobj timeline explicit sync and protocol reving.
>>>>>>
>>>>>> At least I think you'd have to work extra hard to create a gpu which
>>>>>> cannot possibly be intercepted by the kernel, even when it's designed to
>>>>>> support userspace direct submit only.
>>>>>>
>>>>>> Or are your hw engineers more creative here and we're screwed?
>>>>> The upcomming hardware generation will have this hardware scheduler as a
>>>>> must have, but there are certain ways we can still stick to the old
>>>>> approach:
>>>>>
>>>>> 1. The new hardware scheduler currently still supports kernel queues which
>>>>> essentially is the same as the old hardware ring buffer.
>>>>>
>>>>> 2. Mapping the top level ring buffer into the VM at least partially solves
>>>>> the problem. This way you can't manipulate the ring buffer content, but the
>>>>> location for the fence must still be writeable.
>>>> Yeah allowing userspace to lie about completion fences in this model is
>>>> ok. Though I haven't thought through full consequences of that, but I
>>>> think it's not any worse than userspace lying about which buffers/address
>>>> it uses in the current model - we rely on hw vm ptes to catch that stuff.
>>>>
>>>> Also it might be good to switch to a non-recoverable ctx model for these.
>>>> That's already what we do in i915 (opt-in, but all current umd use that
>>>> mode). So any hang/watchdog just kills the entire ctx and you don't have
>>>> to worry about userspace doing something funny with it's ringbuffer.
>>>> Simplifies everything.
>>>>
>>>> Also ofc userspace fencing still disallowed, but since userspace would
>>>> queu up all writes to its ringbuffer through the drm/scheduler, we'd
>>>> handle dependencies through that still. Not great, but workable.
>>>>
>>>> Thinking about this, not even mapping the ringbuffer r/o is required, it's
>>>> just that we must queue things throug the kernel to resolve dependencies
>>>> and everything without breaking dma_fence. If userspace lies, tdr will
>>>> shoot it and the kernel stops running that context entirely.
>> Thinking more about that approach I don't think that it will work correctly.
>>
>> See we not only need to write the fence as signal that an IB is submitted,
>> but also adjust a bunch of privileged hardware registers.
>>
>> When userspace could do that from its IBs as well then there is nothing
>> blocking it from reprogramming the page table base address for example.
>>
>> We could do those writes with the CPU as well, but that would be a huge
>> performance drop because of the additional latency.
> That's not what I'm suggesting. I'm suggesting you have the queue and
> everything in userspace, like in wondows. Fences are exactly handled like
> on windows too. The difference is:
>
> - All new additions to the ringbuffer are done through a kernel ioctl
>    call, using the drm/scheduler to resolve dependencies.
>
> - Memory management is also done like today int that ioctl.
>
> - TDR makes sure that if userspace abuses the contract (which it can, but
>    it can do that already today because there's also no command parser to
>    e.g. stop gpu semaphores) the entire context is shot and terminally
>    killed. Userspace has to then set up a new one. This isn't how amdgpu
>    recovery works right now, but i915 supports it and I think it's also the
>    better model for userspace error recovery anyway.
>
> So from hw pov this will look _exactly_ like windows, except we never page
> fault.
>
>  From sw pov this will look _exactly_ like current kernel ringbuf model,
> with exactly same dma_fence semantics. If userspace lies, does something
> stupid or otherwise breaks the uapi contract, vm ptes stop invalid access
> and tdr kills it if it takes too long.
>
> Where do you need priviledge IB writes or anything like that?

For writing the fence value and setting up the priority and VM registers.

Christian.

>
> Ofc kernel needs to have some safety checks in the dma_fence timeline that
> relies on userspace ringbuffer to never go backwards or unsignal, but
> that's kinda just more compat cruft to make the kernel/dma_fence path
> work.
>
> Cheers, Daniel