[RFC] Add DMA_RESV_USAGE flags

Thu May 20 07:58:59 UTC 2021

On Wed, May 19, 2021 at 1:24 PM Christian König
<ckoenig.leichtzumerken at gmail.com> wrote:
>
> Am 18.05.21 um 23:17 schrieb Daniel Vetter:
> > [SNIP]
> >> The problem in this case is not starting a new CS, but synchronizing to
> >> the existing ones.
> >>
> >> See a heavy TLB flush is made completely out of sync. E.g. it doesn't
> >> want to wait for any previous operation.
> >>
> >> In other words imagine the following example:
> >> 1. Both process A and B have a BO mapped.
> >> 2. Process A is heavily using the BO and doing all kind of rendering.
> >> 3. Process B is unmapping the BO.
> >>
> >> Now that process B unmaps the BO needs to trigger page table updates and
> >> a heavy TLB flush, but since this can take really long we want to do it
> >> asynchronously on the hardware.
> >>
> >> With the current approach you basically can't do that because you can't
> >> note that a fence should not participate in synchronization at all.
> >>
> >> E.g. we can't add a fence which doesn't wait for the exclusive one as
> >> shared.
> > Ok I think that's a real problem, and  guess it's also related to all
> > the ttm privatization tricks and all that. So essentially we'd need
> > the opposite of ttm_bo->moving, as in you can't ignore it, but
> > otherwise it completely ignores all the userspace implicit fence
> > stuff.
>
> It goes into that direction, but doesn't sounds like the full solution
> either.
>
> [SNIP]
> > Can we please stop with the "amdgpu is right, everyone else is wrong" approach?
>
> Well the approach I do here is not "amdgpu is right, everyone else is
> wrong". But rather we had DRM uAPI for i915, nouveau and radeon and
> unfortunately leaked that into DMA-buf without much thinking about it.
>
> I'm also not saying that the approach amdgpu is right. It's just what
> amdgpu needs in it's CS interface.
>
> What I'm saying is that DMA-buf is a device driver independent subsystem
> and we shouldn't make any assumption which come from just a handful of
> DRM driver on it's implicit sync implementation.
>
> > Like I'm pretty much going to type up the patch that does a full drm
> > subsytem audit of everything and whack amdgpu into compliance. Perf
> > hit be damned, you had a few years to fix this with better uapi. Or I
> > find out that there's a giant inconsistent mess, but at least we'd
> > gain some clarity about where exactly we are here and maybe what to do
> > next.
>
> Ok to let us move forward please take a look at the first patches of the
> set. It cleans up quite a bunch of the mess we have in there before even
> coming to adding flags to the shared slots.
>
> I think you will agree on that we should do is cleaning up the use cases
> further and separate implicit sync from resource management.

Just replying on this because I'm a bit busy with reviewing everything
we have in upstream right now.

I agree there's some useful stuff in there, but we have a fundamental
disagreement on how this works. That needs to be resolved first, and
as part of that we need to come up with a plan how to get everyone on
the same page.

Then next thing is a plan how to get the various issues you're raising
around dma_resv rules sorted out.

Once we have that, and only then, does it imo make sense to
review/merge cleanup patches. As long as we have fundamental
disagreements along the lines like we have here there's no point.

I should have a patch set maybe tomorrow or early next week with my
results of the drm subsystem review of how exactly dma_resv is used
currently. Thus far it's a few pages of code analysis, but not yet
complete. Also I found some smaller issues in a few places, so the
discussion is going to involve a few more people until we're settled
here :-/

Cheers, Daniel

> In other words we forbid touching the exclusive and shared fences
> directly and have separate APIs for resource management and implicit sync.
>
> This makes sense anyway, no matter what implicit synchronization
> framework we will install underneath.
>
> Regards,
> Christian.
>
> > -Daniel
> >
> >> Regards,
> >> Christian.
> >>
> >>> After that I think we can look at what exact oversync issue remains
> >>> and why and solve it, but until we have this this just feels like
> >>> another rehash of "amgpu insist its own dma_resv interpration is the
> >>> right one and everyone else should move one over".
> >>>
> >>> Or maybe I've just become real garbage at reading random driver code,
> >>> wouldn't be the first time :-)
> >>>
> >>> Cheers, Daniel
> >>>
> >>>> Regards,
> >>>> Christian.
> >>>>
> >>>>> Cheers, Daniel
> >>>>>
> >>>>>> --Jason
> >>>>>>
> >>>>>>
> >>>>>>>> That's also the reason the Valve guys came up with a solution where each
> >>>>>>>> BO gets a flag for explicit sync, but that only works for exports and
> >>>>>>>> not for imports.
> >>>>>>>>
> >>>>>>>>> I915 and iirc msm has explicit flags for this, panfrost was designed to
> >>>>>>>>> support this correctly from the start (also with flags I think). That's at
> >>>>>>>>> least what I remember from all the discussions at XDC and #dri-devel, but
> >>>>>>>>> didn't check the code again to give you the list of uapi flags you need
> >>>>>>>>> for each driver.
> >>>>>>>>>
> >>>>>>>>> The other piece is making sure you're only picking up implicit fences when
> >>>>>>>>> you should, and not any later ones, for which Jason has a solution:
> >>>>>>>>>
> >>>>>>>>> https://lore.kernel.org/dri-devel/20210317221940.2146688-1-jason@jlekstrand.net/
> >>>>>>>> Yes, I helped with that as well. But I think that this is just another
> >>>>>>>> workaround without really addressing the underlying problem.
> >>>>>>>>
> >>>>>>>>> If amdgpu isn't using those, then you will suffer from
> >>>>>>>>> over-synchronization in vulkan and pay a price. The entire point of vulkan
> >>>>>>>>> is that you pick up sync points very explicitly, and we also need to have
> >>>>>>>>> very explicit uapi for userspace to pick up/set the implicit fences.
> >>>>>>>>>
> >>>>>>>>> Trying to paper over this with more implicit magic is imo just wrong, and
> >>>>>>>>> definitely not the long term explicit sync model we want.
> >>>>>>>> I completely disagree.
> >>>>>>>>
> >>>>>>>> In my opinion the implicit sync model we have for dma_resv currently is
> >>>>>>>> just not well designed at all, since it always requires cooperation from
> >>>>>>>> userspace.
> >>>>>>>>
> >>>>>>>> In other words you need to know when to enable implicit sync in
> >>>>>>>> userspace and that information is simply not present all of the time.
> >>>>>>>>
> >>>>>>>> What we have done here is just keeping the old reader/writer flags i915,
> >>>>>>>> radeon and nouveau once had and pushed that out to everybody else making
> >>>>>>>> the assumption that everybody would follow that without documenting the
> >>>>>>>> actual rules of engagement you need to follow here.
> >>>>>>>>
> >>>>>>>> That was a really big mistake and we should try to fix that sooner or
> >>>>>>>> later. The only other clean alternative I see is to use a flag on the
> >>>>>>>> exporter to tell the importer if it should sync to shared fences or not.
> >>>>>>>>
> >>>>>>>> Additional to that I'm perfectly fine with implicit sync. Explicit sync
> >>>>>>>> certainly has some use cases as well, but I don't see it as an absolute
> >>>>>>>> advantage over the implicit model.
> >>>>>>> Ok this stops making sense. Somehow you claim userspace doesn't know
> >>>>>>> when to sync, but somehow the kernel does? By guessing, and getting it
> >>>>>>> wrong mostly, except for the one case that you benchmarked?
> >>>>>>>
> >>>>>>> Aside from silly userspace which exports a buffer to a dma-buf, but
> >>>>>>> then never imports it anywhere else, there isn't a case I know of
> >>>>>>> where the kernel actually knows more than userspace. But there's lots
> >>>>>>> of cases where the kernel definitely knows less, especially if
> >>>>>>> userspace doesn't tell it about what's going on with each rendering
> >>>>>>> and buffer.
> >>>>>>>
> >>>>>>> So here's the 2 things you need to make this work like every other driver:
> >>>>>>>
> >>>>>>> 1. A way to set the explicit fence on a buffer. CS ioctl is perfectly
> >>>>>>> fine, but also can be seperate. Userspace uses this only on a) shared
> >>>>>>> buffers b) when there's a flush/swap on that shared buffer. Not when
> >>>>>>> rendering any of the interim stuff, that only leads to oversync.
> >>>>>>> Anything non-shared is handled explicitly in userspace (at least for
> >>>>>>> modern-ish drivers). This is the only thing that ever sets an
> >>>>>>> exclusive fence (aside from ttm moving buffers around ofc).
> >>>>>>>
> >>>>>>> 2. A way to sync with the implicit fences, either all of them (for
> >>>>>>> upcoming write access) or just the write fence (for read access). At
> >>>>>>> first we thought it's good enough to do this in the CS ioctl, but
> >>>>>>> that's a wee bit too late, hence the patches from Jason. My
> >>>>>>> understanding is that vulkan converts this into an vk syncobj/fence of
> >>>>>>> some sorts, so really can't make this more explicit and intentional
> >>>>>>> than that.
> >>>>>>>
> >>>>>>> None of this is something the kernel has the slightest idea about when
> >>>>>>> it happens, so you have to have explicit uapi for it. Trying to fake
> >>>>>>> it in the kernel just doesn't work.
> >>>>>>> -Daniel
> >>>>>>> --
> >>>>>>> Daniel Vetter
> >>>>>>> Software Engineer, Intel Corporation
> >>>>>>> http://blog.ffwll.ch
> >>>>> --
> >>>>> Daniel Vetter
> >>>>> Software Engineer, Intel Corporation
> >>>>> http://blog.ffwll.ch
> >
>

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch