<div dir="auto"><div><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu., Jun. 3, 2021, 06:03 Daniel Vetter, <<a href="mailto:daniel@ffwll.ch">daniel@ffwll.ch</a>> wrote: </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Olšák wrote: > On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter <<a href="mailto:daniel@ffwll.ch" target="_blank" rel="noreferrer">daniel@ffwll.ch</a>> wrote: > > > On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote: > > > On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter <<a href="mailto:daniel@ffwll.ch" target="_blank" rel="noreferrer">daniel@ffwll.ch</a>> wrote: > > > > > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote: > > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák <<a href="mailto:maraeo@gmail.com" target="_blank" rel="noreferrer">maraeo@gmail.com</a>> wrote: > > > > > > > > > > > Yes, we can't break anything because we don't want to complicate > > things > > > > > > for us. It's pretty much all NAK'd already. We are trying to gather > > > > more > > > > > > knowledge and then make better decisions. > > > > > > > > > > > > The idea we are considering is that we'll expose memory-based sync > > > > objects > > > > > > to userspace for read only, and the kernel or hw will strictly > > control > > > > the > > > > > > memory writes to those sync objects. The hole in that idea is that > > > > > > userspace can decide not to signal a job, so even if userspace > > can't > > > > > > overwrite memory-based sync object states arbitrarily, it can still > > > > decide > > > > > > not to signal them, and then a future fence is born. > > > > > > > > > > > > > > > > This would actually be treated as a GPU hang caused by that context, > > so > > > > it > > > > > should be fine. > > > > > > > > This is practically what I proposed already, except your not doing it > > with > > > > dma_fence. And on the memory fence side this also doesn't actually give > > > > what you want for that compute model. > > > > > > > > This seems like a bit a worst of both worlds approach to me? Tons of > > work > > > > in the kernel to hide these not-dma_fence-but-almost, and still pain to > > > > actually drive the hardware like it should be for compute or direct > > > > display. > > > > > > > > Also maybe I've missed it, but I didn't see any replies to my > > suggestion > > > > how to fake the entire dma_fence stuff on top of new hw. Would be > > > > interesting to know what doesn't work there instead of amd folks going > > of > > > > into internal again and then coming back with another rfc from out of > > > > nowhere :-) > > > > > > > > > > Going internal again is probably a good idea to spare you the long > > > discussions and not waste your time, but we haven't talked about the > > > dma_fence stuff internally other than acknowledging that it can be > > solved. > > > > > > The compute use case already uses the hw as-is with no inter-process > > > sharing, which mostly keeps the kernel out of the picture. It uses > > glFinish > > > to sync with GL. > > > > > > The gfx use case needs new hardware logic to support implicit and > > explicit > > > sync. When we propose a solution, it's usually torn apart the next day by > > > ourselves. > > > > > > Since we are talking about next hw or next next hw, preemption should be > > > better. > > > > > > user queue = user-mapped ring buffer > > > > > > For implicit sync, we will only let userspace lock access to a buffer > > via a > > > user queue, which waits for the per-buffer sequence counter in memory to > > be > > > >= the number assigned by the kernel, and later unlock the access with > > > another command, which increments the per-buffer sequence counter in > > memory > > > with atomic_inc regardless of the number assigned by the kernel. The > > kernel > > > counter and the counter in memory can be out-of-sync, and I'll explain > > why > > > it's OK. If a process increments the kernel counter but not the memory > > > counter, that's its problem and it's the same as a GPU hang caused by > > that > > > process. If a process increments the memory counter but not the kernel > > > counter, the ">=" condition alongside atomic_inc guarantee that > > signaling n > > > will signal n+1, so it will never deadlock but also it will effectively > > > disable synchronization. This method of disabling synchronization is > > > similar to a process corrupting the buffer, which should be fine. Can you > > > find any flaw in it? I can't find any. > > > > Hm maybe I misunderstood what exactly you wanted to do earlier. That kind > > of "we let userspace free-wheel whatever it wants, kernel ensures > > correctness of the resulting chain of dma_fence with reset the entire > > context" is what I proposed too. > > > > Like you say, userspace is allowed to render garbage already. > > > > > The explicit submit can be done by userspace (if there is no > > > synchronization), but we plan to use the kernel to do it for implicit > > sync. > > > Essentially, the kernel will receive a buffer list and addresses of wait > > > commands in the user queue. It will assign new sequence numbers to all > > > buffers and write those numbers into the wait commands, and ring the hw > > > doorbell to start execution of that queue. > > > > Yeah for implicit sync I think kernel and using drm/scheduler to sort out > > the dma_fence dependencies is probably best. Since you can filter out > > which dma_fence you hand to the scheduler for dependency tracking you can > > filter out your own ones and let the hw handle those directly (depending > > how much your hw can do an all that). On i915 we might do that to be able > > to use MI_SEMAPHORE_WAIT/SIGNAL functionality in the hw and fw scheduler. > > > > For buffer tracking with implicit sync I think cleanest is probably to > > still keep them wrapped as dma_fence and stuffed into dma_resv, but > > conceptually it's the same. If we let every driver reinvent their own > > buffer tracking just because the hw works a bit different it'll be a mess. > > > > Wrt wait commands: I'm honestly not sure why you'd do that. Userspace gets > > to keep the pieces if it gets it wrong. You do still need to handle > > external dma_fence though, hence drm/scheduler frontend to sort these out. > > > > The reason is to disallow lower-privileged process to deadlock/hang a > higher-privileged process where the kernel can't tell who did it. If the > implicit-sync sequence counter is read only to userspace and only > incrementable by the unlock-signal command after the lock-wait command > appeared in the same queue (both together forming a critical section), > userspace can't manipulate it arbitrarily and we get almost the exact same > behavior as implicit sync has today. That means any implicitly-sync'd > buffer from any process can be fully trusted by a compositor to signal in a > finite time, and possibly even trusted by the kernel. The only thing that's > different is that a malicious process can disable implicit sync for a > buffer in all processes/kernel, but it can't hang other processes/kernel > (it can only hang itself and the kernel will be notified). So I'm a happy > panda now. :) Yeah I think that's not going to work too well, and is too many piled up hacks. Within a drm_file fd you can do whatever you feel like, since it's just one client. But once implicit sync kicks in I think you need to go with dma_fence and drm/scheduler to handle the dependencies, and tdr kicking it. With the dma_fence you do know who's the offender - you might not know why, but that doesn't matter, you just shred the entire context and let that userspace figure out the details. I think trying to make memory fences work as implicit sync directly, without wrapping them in a dma_fence and assorted guarantees, will just not work. And once you do wrap them in dma_fence, then all the other problems go away: cross-driver sync, syncfiles, ... So I really don't see the benefit of this half-way approach. Yes there's going to be a tad bit of overhead, but that's already there in the current model. And it can't hurt to have a bit of motivation for compositors to switch over to userspace memory fences properly. </blockquote></div></div><div dir="auto"> </div><div dir="auto">Well, Christian thinks that we need a high level synchronization primitive in hw. I don't know myself and you may be right. A software scheduler with user queues might be one option. My part is only to find out how much of the scheduler logic can be moved to the hardware.</div><div dir="auto"> </div><div dir="auto">We plan to have memory timeline semaphores, or simply monotonic counters, and a fence will be represented by the counter address and a constant sequence number for the <= comparison. One counter can represent up to 2^64 different fences. Giving any process write access to a fence is the same as giving it the power to manipulate the signalled state of a sequence of up to 2^64 fences. That could mess up a lot of things. However, if the hardware had a high level synchronization primitive with access rights and a limited set of clearly defined operations such that we can formally prove whether it's safe for everybody, we could have a solution where we don't have to involve the software scheduler and just let the hardware do everything.</div><div dir="auto"> </div><div dir="auto">Marek</div><div dir="auto"> </div><div dir="auto"> </div><div dir="auto"> </div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> -Daniel -- Daniel Vetter Software Engineer, Intel Corporation <a href="http://blog.ffwll.ch" rel="noreferrer noreferrer" target="_blank">http://blog.ffwll.ch</a> </blockquote></div></div></div>