[Mesa-dev] [PATCH 01/11] drm/amdgpu: Comply with implicit fencing rules

Tue May 25 15:05:18 UTC 2021

Hi Daniel,

Am 25.05.21 um 15:05 schrieb Daniel Vetter:
> Hi Christian,
>
> On Sat, May 22, 2021 at 10:30:19AM +0200, Christian König wrote:
>> Am 21.05.21 um 20:31 schrieb Daniel Vetter:
>>> [SNIP]
>>>> We could provide an IOCTL for the BO to change the flag.
>>> That's not the semantics we need.
>>>
>>>> But could we first figure out the semantics we want to use here?
>>>>
>>>> Cause I'm pretty sure we don't actually need those changes at all and as
>>>> said before I'm certainly NAKing things which break existing use cases.
>>> Please read how other drivers do this and at least _try_ to understand
>>> it. I'm really loosing my patience here with you NAKing patches you're
>>> not even understanding (or did you actually read and fully understand
>>> the entire story I typed up here, and your NAK is on the entire
>>> thing?). There's not much useful conversation to be had with that
>>> approach. And with drivers I mean kernel + userspace here.
>> Well to be honest I did fully read that, but I was just to emotionally
>> attached to answer more appropriately in that moment.
>>
>> And I'm sorry that I react emotional on that, but it is really frustrating
>> that I'm not able to convince you that we have a major problem which affects
>> all drivers and not just amdgpu.
>>
>> Regarding the reason why I'm NAKing this particular patch, you are breaking
>> existing uAPI for RADV with that. And as a maintainer of the driver I have
>> simply no other choice than saying halt, stop we can't do it like this.
>>
>> I'm perfectly aware that I've some holes in the understanding of how ANV or
>> other Vulkan/OpenGL stacks work. But you should probably also admit that you
>> have some holes how amdgpu works or otherwise I can't imagine why you
>> suggest a patch which simply breaks RADV.
>>
>> I mean we are working together for years now and I think you know me pretty
>> well, do you really think I scream bloody hell we can't do this without a
>> good reason?
>>
>> So let's stop throwing halve backed solutions at each other and discuss what
>> we can do to solve the different problems we are both seeing here.
> Well this was meant to be a goal post/semantics discussion starter. Yes
> the patch breaks performance (but not correctness) for amdgpu, but it also
> contains my suggestion for how to fix that issue (in text form at least).

Well as far as I can see this really breaks uAPI, we have unit tests 
exercising this.

But see more on this below.

> Plus a plan how to roll it out so that anyone who cares doesn't hit the
> perf issues this patch can cause.
>
> Also the overall series is really meant as a subsystem wide assessment of
> the status quo. Aside from a correctness issue Lucas spotted in my panfrost
> patches no substantial input from others on this yet unfortunately. I need
> to poke more people I think.
>
> Anyway since the plan as a text didn't stick I'm typing up now something
> more detailed in form of amdgpu patches. Maybe Bas can do the radv
> conversion too.
>
> It won't be complete by far either (I'm not working for amd after all
> ...), I'll leave out the opengl/media side entirely. But if this works for
> radv is should be a useful blueprint for gl/media too (with some changes
> in the interfaces, but not really the exposed semantics).

Yeah, but to make my point clear once more: I can't allow any patch in 
which would change amdgpus existing uAPI behavior.

What we can talk about is changing the behavior by adding flags to the 
fpriv to change the behavior and/or stuff the CS fence by default into 
the exclusive slot.

For the later I think we could do something like using a dma_fence_chain 
for the exclusive fence in amdgpu. This way we would have the same 
semantics in the CS and still support the epoll and Jasons new import IOCTL.

But the semantics of the amdgpu CS interface to not serialize from the 
same process and always serialize if you see some different process must 
stay the same or otherwise I have quite a bunch of angry end users.

>>> That's the other frustration part: You're trying to fix this purely in
>>> the kernel. This is exactly one of these issues why we require open
>>> source userspace, so that we can fix the issues correctly across the
>>> entire stack. And meanwhile you're steadfastily refusing to even look
>>> at that the userspace side of the picture.
>> Well I do fully understand the userspace side of the picture for the AMD
>> stack. I just don't think we should give userspace that much control over
>> the fences in the dma_resv object without untangling them from resource
>> management.
>>
>> And RADV is exercising exclusive sync for amdgpu already. You can do
>> submission to both the GFX, Compute and SDMA queues in Vulkan and those
>> currently won't over-synchronize.
>>
>> When you then send a texture generated by multiple engines to the Compositor
>> the kernel will correctly inserts waits for all submissions of the other
>> process.
>>
>> So this already works for RADV and completely without the IOCTL Jason
>> proposed. IIRC we also have unit tests which exercised that feature for the
>> video decoding use case long before RADV even existed.
> Yeah multiple engines on the producer side works fine with the current
> scheme you have (if we ignore for now that the way amdgpu uses dma_resv is
> different from other drivers by not setting the exclusive slots for
> producers).
>
> Where it breaks down is you have overlapping reads once a frame is
> generated, on either side. E.g.
>
> - compositors read the buffer as consumer
> - but also producer reads the buffer again (maybe reference frame for
>    media, or maybe for some post-processing like motion blurr or whatever).
>
> Then your current scheme really badly oversyncs, while other drivers
> don't have this issue.

No, that is correct behavior. This was added intentionally and we have 
an use cases which rely on this.

The concept is that when two processes access the same buffer object 
they should serialize, no matter what.

You can opt-out of that by setting the EXPLICIT_SYNC flag on the BO.

> Fixing this for amdgpu will have quite a big impact
> on how dma_resv will be used by amdgpu, and that's why I think before
> we've looked at this it doesn't make sense to look at changing dma_resv in
> big ways.
>
> I do think the AMDGPU_SYNC_NE_OWNER trick is pretty neat, and should help
> with some very non-explicit userspace (gl, maybe also media but at least
> ours is internally moving to a more explicit model I think) when you split
> the workload over multiple engines. For vulkan I think the right model is
> AMDGPU_SYNC_EXPLICIT, plus something like what Jason has.
>
> Except there's going to be quite some twist, but I think that's best
> explained in patches since text clearly doesn't work well.

Yeah, I'm also wondering how we can ever merge the two approaches back 
together. I'm not sure that this is possible.

>> And yes I have to admit that I haven't thought about interaction with other
>> drivers when I came up with this because the rules of that interaction
>> wasn't clear to me at that time.
> Ok I think we're at least distilling the core of our disagreement here,
> that's some progress:
>
> - my take: to make implicit sync work well and avoid oversync issues,
>    userspace, and currently amgpu lacks these uapis so needs to get those
>    added.
>
> - your take: please lets not give so much control to userspace
>
> Tbh I'm not yet fully understanding your concern here, but I do agree that
> there's a lot of things that need to be carefully audited here, at least
> in amdgpu. From my driver wide audit I do think in general playing clever
> tricks is ok, since drivers treat the exclusive fence as just a special
> type of shared fence and don't e.g. ignore the shard fences if an
> exclusive one is present. For the generic helper version of this see
> drm_gem_fence_array_add_implicit(), but afaiui amdgpu works the same (or
> at least similar enough), as do other drivers playing funny games.

Well do you mean the other way around? E.g. that you can *not* ignore 
the exclusive fence when shared once are present?

As far as I know nouveau is the only driver left using that (maybe i915 
somewhere, but not sure) and it would really help if we could remove that.

>>> Also I thought through your tlb issue, why are you even putting these
>>> tlb flush fences into the shard dma_resv slots? If you store them
>>> somewhere else in the amdgpu private part, the oversync issues goes
>>> away
>>> - in your ttm bo move callback, you can just make your bo copy job
>>> depend on them too (you have to anyway)
>>> - even for p2p there's not an issue here, because you have the
>>> ->move_notify callback, and can then lift the tlb flush fences from
>>> your private place to the shared slots so the exporter can see them.
>> Because adding a shared fence requires that this shared fence signals after
>> the exclusive fence. And this is a perfect example to explain why this is so
>> problematic and also why why we currently stumble over that only in amdgpu.
> So I also have vague collections that we agreed this is required, but I
> don't think it's the case. I'm at least pretty sure we already have
> multiple drivers which violate this (non of them using TTM yet).

Yeah, I know. Basically everybody which adds more than a CS fence to the 
shared slots will ignore this.

As said above if we drop this (which would be great) at least nouveau 
would need to be fixed.

>> In TTM we have a feature which allows evictions to be pipelined and don't
>> wait for the evicting DMA operation. Without that driver will stall waiting
>> for their allocations to finish when we need to allocate memory.
>>
>> For certain use cases this gives you a ~20% fps increase under memory
>> pressure, so it is a really important feature.
> Yeah that's something I'm banging my head against right now a bit for my
> amdgpu demo patch series.
>
>> This works by adding the fence of the last eviction DMA operation to BOs
>> when their backing store is newly allocated. That's what the
>> ttm_bo_add_move_fence() function you stumbled over is good for: https://elixir.bootlin.com/linux/v5.13-rc2/source/drivers/gpu/drm/ttm/ttm_bo.c#L692
>>
>> Now the problem is it is possible that the application is terminated before
>> it can complete it's command submission. But since resource management only
>> waits for the shared fences when there are some there is a chance that we
>> free up memory while it is still in use.
> Hm where is this code? Would help in my audit that I wanted to do this
> week? If you look at most other places like
> drm_gem_fence_array_add_implicit() I mentioned earlier, then we don't
> treat the shared fences special and always also include the exclusive one.

See amdgpu_gem_object_close():

...
         fence = dma_resv_get_excl(bo->tbo.base.resv);
         if (fence) {
                 amdgpu_bo_fence(bo, fence, true);
                 fence = NULL;
         }
...

We explicitly added that because resource management of some other 
driver was going totally bananas without that.

But I'm not sure which one that was. Maybe dig a bit in the git and 
mailing history of that.

>> Because of this we have some rather crude workarounds in amdgpu. For example
>> IIRC we manual wait for any potential exclusive fence before freeing memory.
>>
>> We could enable this feature for radeon and nouveau as well with an one line
>> change. But that would mean we need to maintain the workarounds for
>> shortcomings of the dma_resv object design in those drivers as well.
>>
>> To summarize I think that adding an unbound fence to protect an object is a
>> perfectly valid operation for resource management, but this is restricted by
>> the needs of implicit sync at the moment.
> Hm how are unbound fences (what do you mean with this exactly) come into
> play here now? I think you lost me here on the last 1-2 paragraphs, before
> that I think I followed.

Unbound like in not waiting for the exclusive fence.

>>> The kernel move fences otoh are a bit more nasty to wring through the
>>> p2p dma-buf interface. That one probably needs something new.
>> Well the p2p interface are my least concern.
>>
>> Adding the move fence means that you need to touch every place we do CS or
>> page flip since you now have something which is parallel to the explicit
>> sync fence.
>>
>> Otherwise having the move fence separately wouldn't make much sense in the
>> first place if we always set it together with the exclusive fence.
> Yeah it's a bunch of work. But for i915 this is the model we have, so we
> have to do it anyway, so I have really good excuse here to do that ttm
> audit.
>
>> Best regards and sorry for getting on your nerves so much,
> tbh I've also been rather thinly nerved on this. One side was that I spent
> the last 1+ years having an eerily similar with i915-gem team about how a
> single driver can't just have totally different rules for core stuff like
> dma_resv/fence/locks, and unfortunately that entire story went really,
> horribly wrong :-/ So I'm very much "oh noes pls not again".

Dito. I mean I didn't came up with the approach for amdgpu out of 
nowhere, but rather because we had some specific use cases for that.

Back in the R6xx days some guys where putting a semaphore at the end of 
each BO and serialized engine access to the BO using that. E.g. see how 
radeon works internally with the semaphore block.

This is basically the same concept implemented with the dma_resv object 
instead.

> But also the long w/e here helped, yay :-)

Yeah, indeed.

Cheers,
Christian.

>
> I think there's a few questions here that we can ping/pong a bit more, but
> I think for the next meaningful round I need to get this draft set of
> patches a bit into shape here, and audit more code. I think hopefully
> early next week I'll have something which isn't too much full of holes
> which should help in moving forward on this discussion.
>
> Cheers, Daniel