[PATCH 1/6] dma-buf: add dynamic DMA-buf handling v13
ckoenig.leichtzumerken at gmail.com
Wed Jul 31 11:19:37 UTC 2019
Am 31.07.19 um 12:39 schrieb Daniel Vetter:
> On Wed, Jul 31, 2019 at 11:44 AM Christian König
> <ckoenig.leichtzumerken at gmail.com> wrote:
>> Am 31.07.19 um 11:12 schrieb Daniel Vetter:
>>> I think I brought this up before, but new top-post for a clean start.
>>> Use-case I have in mind is something like amdkfd's model, where you have a
>>> list of buffers (per context or whatever) that you always need to have
>>> present. Idea is to also use this for traditional CS for vk/gl, to cut
>>> down on the buffer management overhead, but we'd still allow additional
>>> buffers to be listed per-CS on top of that default working set.
>>> This of course means no implicit sync anymore on these default buffers
>>> (the point is to avoid touching every buffer on every CS, updating fences
>>> would defeat that). That's why the CS can still list additional buffers,
>>> the only reason for that is to add implicit sync fences. Those buffers
>>> would be most likely in the default working set already.
>>> Consequence is that I want the amdkfd model of "evict when needed, but
>>> keep resident by default", but also working implicit fences. And it must
>>> be doable without touching every bo on every CS. Listing possible
>>> implementation options:
>>> - the amdkfd trick doesn't work because it would break implicit fencing -
>>> any implicit sync would always result in the context getting
>>> preempted/evicted, which isn't great.
>> I'm actually working on re-working implicit fencing towards better
>> supporting this.
>> Basic idea is that you split up the fences in a reservation object into
>> three categories:
>> 1. Implicit sync on write.
>> 2. Implicit sync on read.
>> 3. No implicit sync at all.
> Not really sure what you want to do here ... implicit sync is opt-in
> (or opt-out flag if you need to keep CS backwards compat) per BO/CS.
> At least when we discussed this forever at some XDCs consensus was
> that storing the implicit sync mode on the BO is not going to work.
Well that's exactly what we are doing for amdgpu and this is working
perfectly. See flag AMDGPU_GEM_CREATE_EXPLICIT_SYNC.
>> This should not only help the KFD, but also with amdgpu command
>> submission and things like page tables updates.
>> E.g. we need to fences for page tables updates around in reservation
>> objects as well, but you really really really don't want any implicit
>> synchronization with them :)
> Why do you even try to do implicit sync with your pagetables? How can
> your pagetables even get anywhere near where implicit sync matters?
When you unmap a BO from a page table the BO needs to stay in the same
place until the unmap operation is completed.
Otherwise you open up a short window where a process could access memory
which doesn't belongs to the process.
This unmap operation is usually an SDMA operation and nobody except for
the memory management needs to take this into account.
>>> - we share the resv_obj between all the buffers in the default working set
>>> of a context, with unsharing/resharing the resv_obj if they enter/leave
>>> the default working set. That way there's only one resv_obj to update on
>>> each CS, and we can attach a new shared fence for every CS. Trouble is
>>> that this means a given buffer can only be part of one default working
>>> set, so all shared buffers would need to be listed again separately. Not
>>> so great if userspace has to deal with that fairly arbitrary limitation.
>> Yeah, that is exactly what we do with the per VM BOs in amdgpu.
>> The limitation that you have only one working set actually turned out to
>> be not a limitation at all, but rather seen as something welcomed by our
>> Vulkan guys.
> We have per-ctx vms in i915, but I guess even for those sharing will be limited.
In amdgpu we had this funky stuff with bo lists which should represent
the resources used for a command submission.
But after actually talking to the Vulkan and other userspace guys we
completely deprecated that.
We settled on having per process resources which are always valid and a
dynamic list of resources you send to the kernel with each command
>> I also don't really see a way to have an implementation with good
>> performance where BOs can be in multiple working sets at the same time.
>>> - we allow the ->move_notify callback to add new fences, which the
>>> exporter needs to wait on before it schedules the pipelined move. This
>>> also avoids the per-BO update on every CS, and it would allow buffers to
>>> be shared and to be in multiple default working sets. The downside is
>>> that ->move_notify needs to be able to cope with added fences, which
>>> means we might need to grow the shared fences array, which might fail
>>> with ENOMEM. Not great. We could fix this with some kind of permanent
>>> shared fence slot reservations (i.e. a reserved slot which outlives
>>> holding the reservation lock), but that might waste quite a bit of
>>> memory. Probably not real problem in the grand scheme of things. I think
>>> the fence itself can be preallocated per context, so that isn't the
>> Well the ENOMEM problem is the least of the problems with this approach.
>> You can still block for the fence which you wanted to add.
>> The real problem is that you can't tell if a BO is busy or not by just
>> looking at its current fences.
>> In other words when you stop adding fences you also want to stop moving
>> them individually on the LRU.
> Well yeah, otherwise you're back a per BO overhead on CS. That's kinda
> obvious. But also not sure why this matters, since atm we don't have
> any proposed means to pass LRU updates through dma-buf. So exporter
> (even with dynamic dma-buf) has to pessimistically assume already that
> the exported BO is more busy than what the LRU position suggests, and
> evict them later on. If we also need to funnel LRU updates over
> dma-buf, then that's another beast entirely (and probably not what we
> want to do at all).
Yeah, that's a really good point for DMA-buf.
But I'm thinking a bit wider here. Essentially we need some ideas which
not only work for DMA-buf, but also inside the same driver.
> Also if you don't want to stall, then just reject after ->move_notify
> wherethere the bo is still idle, and give up if so. Or we can add a
> counter to indicate a bo needs to be considered busy.
>> When the memory management evicts one BO you essentially kick out a
>> whole process/working set.
>> So when you want to kick out the next BO you actually want to do this
>> for BOs which now became available anyway.
>> That approach won't work with the move_notify callback.
> So essentially we don't just want move_notify, but also full LRU information?
At least for DMA-buf I was trying to avoid that. But in the long term it
might be necessary, yes.
>>> - same as above, but the new fence doesn't get added, but returned to the
>>> caller, and the exporter deals with the ENOMEM mess. Might not work
>>> since an importer could have a lot of contexts using a given object, and
>>> so would have a lot of fences to add.
>> I don't think that this will work.
>> See you not only need to be able to add the fence to the BO currently
>> evicted, but also to all other BO in your process/working set.
>> Additional to that moving the ENOMEM handling from the importer to the
>> exporter sounds as helpful as adding another layer of abstraction :)
> You can go and pick a different victim to evict, or just give up
> (there's not much you can do with ENOMEM). Instead of data corruption
> because you're not waiting for a fence you should be waiting on.
Ok, but you can as well return -ENOMEM from the move_notify callback.
What I wanted to say is that by returning the error code you are still
more flexible than returning the fence. On other words returning the
fence doesn't seem to help...
>>> - something entirely different?
>>> Cheers, Daniel
More information about the amd-gfx