[Intel-gfx] RFC: Add write flag to reservation object fences

Fri Aug 10 09:21:45 UTC 2018

On Fri, Aug 10, 2018 at 11:14 AM, Christian König
<christian.koenig at amd.com> wrote:
> Am 10.08.2018 um 10:29 schrieb Daniel Vetter:
>>
>> [SNIP]
>> I'm only interested in the case of shared buffers. And for those you
>> _do_ pessimistically assume that all access must be implicitly synced.
>> Iirc amdgpu doesn't support EGL_ANDROID_native_fence_sync, so this
>> makes sense that you don't bother with it.
>
>
> See flag AMDGPU_GEM_CREATE_EXPLICIT_SYNC.
>
>
>>
>>>> - as a consequence, amdgpu needs to pessimistically assume that all
>>>> writes to shared buffer need to obey implicit fencing rules.
>>>> - for shared buffers (across process or drivers) implicit fencing does
>>>> _not_ allow concurrent writers. That limitation is why people want to
>>>> do explicit fencing, and it's the reason why there's only 1 slot for
>>>> an exclusive. Note I really mean concurrent here, a queue of in-flight
>>>> writes by different batches is perfectly fine. But it's a fully
>>>> ordered queue of writes.
>>>> - but as a consequence of amdgpu's lack of implicit fencing and hence
>>>> need to pessimistically assume there's multiple write fences amdgpu
>>>> needs to put multiple fences behind the single exclusive slot. This is
>>>> a limitation imposed by by the amdgpu stack, not something inherit to
>>>> how implicit fencing works.
>>>> - Chris Wilson's patch implements all this (and afaics with a bit more
>>>> coffee, correctly).
>>>>
>>>> If you want to be less pessimistic in amdgpu for shared buffers, you
>>>> need to start tracking which shared buffer access need implicit and
>>>> which explicit sync. What you can't do is suddenly create more than 1
>>>> exclusive fence, that's not how implicit fencing works. Another thing
>>>> you cannot do is force everyone else (in non-amdgpu or core code) to
>>>> sync against _all_ writes, because that forces implicit syncing. Which
>>>> people very much don't want.
>>>
>>>
>>> I also do see the problem that most other hardware doesn't need that
>>> functionality, because it is driven by a single engine. That's why I
>>> tried
>>> to keep the overhead as low as possible.
>>>
>>> But at least for amdgpu (and I strongly suspect for nouveau as well) it
>>> is
>>> absolutely vital in a number of cases to allow concurrent accesses from
>>> the
>>> same client even when the BO is then later used with implicit
>>> synchronization.
>>>
>>> This is also the reason why the current workaround is so problematic for
>>> us.
>>> Cause as soon as the BO is shared with another (non-amdgpu) device all
>>> command submissions to it will be serialized even when they come from the
>>> same client.
>>>
>>> Would it be an option extend the concept of the "owner" of the BO amdpgu
>>> uses to other drivers as well?
>>>
>>> When you already have explicit synchronization insider your client, but
>>> not
>>> between clients (e.g. still uses DRI2 or DRI3), this could also be rather
>>> beneficial for others as well.
>>
>> Again: How you synchronize your driver internal rendering is totally
>> up to you. If you see an exclusive fence by amdgpu, and submit new
>> rendering by amdgpu, you can totally ignore the exclusive fence. The
>> only api contracts for implicit fencing are between drivers for shared
>> buffers. If you submit rendering to a shared buffer in parallel, all
>> without attaching an exclusive fence that's perfectly fine, but
>> somewhen later on, depending upon protocol (glFlush or glxSwapBuffers
>> or whatever) you have to collect all those concurrent write hazards
>> and bake them into 1 single exclusive fence for implicit fencing.
>>
>> Atm (and Chris seems to concur) the amdgpu uapi doesn't allow you to
>> do that, so for anything shared you have to be super pessimistic.
>> Adding a HAND_OFF_FOR_IMPLICIT_FENCING flag/ioctl would probably fix
>> that. Only when that flag is set would you take all shared write
>> hazards and bake them into one exclusive fence for hand-off to the
>> next driver. You'd also need the same when receiving an implicitly
>> fenced buffer, to make sure that your concurrent writes do synchronize
>> with reading (aka shared fences) done by other drivers. With a bunch
>> of trickery and hacks it might be possible to infer this from current
>> ioctls even, but you need to be really careful.
>
>
> A new uapi is out of question because we need to be backward compatible.

Since when is new uapi out of the question for a performance improvement?

>> And you're right that amdgpu seems to be the only (or one of the only)
>> drivers which do truly concurrent rendering to the same buffer (not
>> just concurrent rendering to multiple buffers all suballocated from
>> the same bo). But we can't fix this in the kernel with the tricks you
>> propose, because without such an uapi extension (or inference) we
>> can't tell the implicit fencing from the explicit fencing case.
>
>
> Sure we can. As I said for amdgpu that is absolutely no problem at all.
>
> In your terminology all rendering from the same client to a BO is explicitly
> fenced, while all rendering from different clients are implicit fenced.
>
>> And for shared buffers with explicit fencing we _must_ _not_ sync against
>> all writes. owner won't help here, because it's still not tracking
>> whether something is explicit or implicit synced.
>
>
> Implicit syncing can be disable by giving the
> AMDGPU_GEM_CREATE_EXPLICIT_SYNC flag while creating the BO.
>
>> We've cheated a bit with most other drivers in this area, also because
>> we don't have to deal with truly concurrent rendering.
>
>
> Yeah, absolutely nailed it. And this cheating is now completely breaking my
> neck because it doesn't work well at all with the requirements I have at
> hand here.
>
>> So it's not
>> obvious that we're not tracking writes/reads, but implicit/explicit
>> fencing. But semantically we track the later for shared buffers.
>>
>> Cheers, Daniel
>>
>> PS: One idea I have for inference: Every time you see a shared buffer
>> in an amdgpu CS:
>> 1. Grab reservation lock
>> 2. Check all the fences' creators. If any of them are foreign (not by
>> amdgpu), then run the current pessimistic code.
>
>
> That is exactly what we already do.
>
>> 3. If all fences are by amdgpu
>> - Look at the exclusive fence. If it's a ttm bo move keep it, if it's
>> marked as a special implicit syncing fence, ignore it.
>> - Run all CS concurrently by storing all their write fences in the shared
>> slots.
>> - Create a fake exclusive fence which ties all the write hazards into
>> one fence. Mark them as special implicit syncing fences in your
>> amdgpu_fence struct. This will make sure other drivers sync properly,
>> but since you ignore these special it won't reduce amdgpu-internal
>> concurrency.
>
>
> That won't work, adding the exclusive fence removes all shared fences and I
> still need to have those for the sync check above.
>
> I need something like a callback from other drivers that the reservation
> object is now used by them and no longer by amdgpu.

Then don't track _any_ of the amdgpu internal fences in the reservation object:
- 1 reservation object that you hand to ttm, for use internally within amdgpu
- 1 reservation object that you attach to the dma-buf (or get from the
imported dma-buf), where you play all the tricks to fake fences.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch