[RFC] Add DMA_RESV_USAGE flags

Wed May 19 11:43:56 UTC 2021

Am 19.05.21 um 00:06 schrieb Jason Ekstrand:
> [SNIP]
>>> E.g. we can't add a fence which doesn't wait for the exclusive one as
>>> shared.
>> Ok I think that's a real problem, and  guess it's also related to all
>> the ttm privatization tricks and all that. So essentially we'd need
>> the opposite of ttm_bo->moving, as in you can't ignore it, but
>> otherwise it completely ignores all the userspace implicit fence
>> stuff.
> Would you mind explaining it to the rest of the class?  I get the need
> to do a TLB flush after a BO is removed from the processes address
> space and I get that it may be super-heavy and that it has to be
> delayed.  I also get that the driver needs to hold a reference to the
> underlying pages until that TLB flush is done.  What I don't get is
> what this has to do with the exclusive fence.  Why can't the driver
> just gather up all the dma_resv fences on the current object (or,
> better yet, just the ones from the current amdgpu process) and wait on
> them all?  Why does it need to insert an exclusive fence that then
> clogs up the whole works?

Because we have mixed up resource management with implicit syncing.

When I sum up all fences in (for example) a dma_fence_array container 
and add that as explicit fence to the dma_resv object resource 
management will do what I want and wait for everything to finish before 
moving or freeing the buffer. But implicit sync will just horrible over 
sync and wait for stuff it shouldn't wait for in the first place.

When I add the fence as shared fence I can run into the problem the the 
TLB flush might finish before the exclusive fence. Which is not allowed 
according to the DMA-buf fencing rules.

We currently have some rather crude workarounds to make use cases like 
this work as expected. E.g. by using a 
dma_fence_chain()/dma_fence_array() and/or adding the explusive fence to 
the shared fences etc etc...

>>>>>>> Let's say that you have a buffer which is shared between two drivers A
>>>>>>> and B and let's say driver A has thrown a fence on it just to ensure
>>>>>>> that the BO doesn't get swapped out to disk until it's at a good
>>>>>>> stopping point.  Then driver B comes along and wants to throw a
>>>>>>> write-fence on it.  Suddenly, your memory fence from driver A causes
>>>>>>> driver B to have to stall waiting for a "good" time to throw in a
>>>>>>> fence.  It sounds like this is the sort of scenario that Christian is
>>>>>>> running into.  And, yes, with certain Vulkan drivers being a bit
>>>>>>> sloppy about exactly when they throw in write fences, I could see it
>>>>>>> being a real problem.
>>>>>> Yes this is a potential problem, and on the i915 side we need to do
>>>>>> some shuffling here most likely. Especially due to discrete, but the
>>>>>> problem is pre-existing. tbh I forgot about the implications here
>>>>>> until I pondered this again yesterday evening.
>>>>>>
>>>>>> But afaiui the amdgpu code and winsys in mesa, this isn't (yet) the
>>>>>> problem amd vk drivers have. The issue is that with amdgpu, all you
>>>>>> supply are the following bits at CS time:
>>>>>> - list of always mapped private buffers, which is implicit and O(1) in
>>>>>> the kernel fastpath
>>>>>> - additional list of shared buffers that are used by the current CS
>>>>>>
>>>>>> I didn't check how exactly that works wrt winsys buffer ownership, but
>>>>>> the thing is that on the kernel side _any_ buffer in there is treated
>>>>>> as a implicit sync'ed write. Which means if you render your winsys
>>>>>> with a bunch of command submission split over 3d and compute pipes,
>>>>>> you end up with horrendous amounts of oversync.
>>>>> What are you talking about? We have no sync at all for submissions from
>>>>> the same client.
>>>> Yes. Except when the buffer is shared with another driver, at which
>>>> point you sync a _lot_ and feel the pain.
>>> Yes, exactly that's the problem.
>>>
>>> We basically don't know during CS if a BO is shared or not.
>>>
>>> We do know that during importing or exporting the BO thought.
>> No you don't. Or at least that's massively awkward, see Jason's reply.
> Please.  In Vulkan, we know explicitly whether or not any BO will ever
> be shared and, if a BO is ever flagged as shared even though it's not,
> that's the app being stupid and they can eat the perf hit.

Yeah, that's not a problem at all. We already have the per BO flag in 
amdgpu for this as well.

> In GL, things are more wishy-washy but GL has so many stupid cases where we
> have to throw a buffer away and re-allocate that one more isn't going
> to be all that bad.  Even there, you could do something where you add
> an in-fence to the BO export operation so that the driver knows when
> to switch from the shared internal dma_resv to the external one
> without having to create a new BO and copy.

Hui what? What do you mean with in-fence here?

> [SNIP]
>> Yeah but why does your userspace not know when a bo is used?
> We always know when a BO is exported because we're the ones doing the
> export call.  Always.  Of course, we don't know if that BO is shared
> with another driver or re-imported back into the same one but is that
> really the case we're optimizing for?

Yes, unfortunately. Exactly that's one of the reasons we couldn't go 
with the per CS per BO flag if it should be shared or exclusive.

>> Or very bluntly, why cant radv do what anv does (or amdvlk if you care
>> more about that, it's the same)? What's missing with lots of blantant
>> lying?
> I'm also not buying this.  You keep claiming that userspace doesn't
> know but GL definitely does know and Vulkan knows well enough.  You
> say that it's motivated by Vulkan and use RADV as an example but the
> only reason why the RADV guys haven't followed the ANV design is to
> work around limitations in amdgpu.  We shouldn't then use RADV to
> justify why this is the right uAPI and why i915 is wrong.

Well, I never said that this is because of RADV. The main motivation we 
had is because of MM engines, e.g. VA-API, VDPAU and OpenMax.

And when we expose a BO with the DMA-buf functions we simply doesn't 
know in userspace if that is then re-imported into VA-API or send to a 
different process.

>>> [SNIP]
>>> Yeah, and that is exactly the reason why I will NAK this uAPI change.
>>>
>>> This doesn't works for amdgpu at all for the reasons outlined above.
>> Uh that's really not how uapi works. "my driver is right, everyone
>> else is wrong" is not how cross driver contracts are defined. If that
>> means a perf impact until you've fixed your rules, that's on you.
>>
>> Also you're a few years too late with nacking this, it's already uapi
>> in the form of the dma-buf poll() support.
> ^^  My fancy new ioctl doesn't expose anything that isn't already
> there.  It just lets you take a snap-shot of a wait instead of doing
> an active wait which might end up with more fences added depending on
> interrupts and retries.  The dma-buf poll waits on all fences for
> POLLOUT and only the exclusive fence for POLLIN.  It's already uAPI.

Well that's not the stuff I'm concerned about. But rather that you want 
to add that as exclusive fence from the shared ones once more.

This prevents the TLB flush case I've outlined from working correctly.

>>>> So the way I see things right now:
>>>> - exclusive fence slot is for implicit sync. kmd should only set it
>>>> when userspace indicates, otherwise you will suffer. Explicit syncing
>>>> userspace needs to tell the kernel with a flag in the CS ioctl when it
>>>> should sync against this exclusive fence and when it should ignore it,
>>>> otherwise you'll suffer badly once more.
>>> That is not sufficient. The explicit sync slot is for kernel internal
>>> memory management.
>> Then we need to split it. But what I discussed with Thomas Hellstrom
>> is that at least for anything except p2p dma-buf ttm_bo->moving should
>> be enough.
> This is starting to sound like maybe roughly the right direction to me
> but I'm still unclear on exactly what problem we're trying to solve
> for TLB invalidates.  I'd like to understand that better before giving
> strong opinions.  I'm also not super-familiar with ttm_bo->moving but
> it sounds like we need some third category of fence somewhere.

Well I would rather say that we should separate the use cases.

E.g. clear APIs for resource management vs. implicit sync.

Christian.

>
> --Jason
>