[RFC PATCH 00/10] drm/panthor: Add user submission

Wed Sep 4 11:34:18 UTC 2024

Hi Boris,

Am 04.09.24 um 13:23 schrieb Boris Brezillon:
>>>>> Please read up here on why that stuff isn't allowed:
>>>>> https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences   
>>>> panthor doesn't yet have a shrinker, so all memory is pinned, which means
>>>> memory management easy mode.
>>> Ok, that at least makes things work for the moment.
>> Ah, perhaps this should have been spelt out more clearly ;)
>>
>> The VM_BIND mechanism that's already in place jumps through some hoops
>> to ensure that memory is preallocated when the memory operations are
>> enqueued. So any memory required should have been allocated before any
>> sync object is returned. We're aware of the issue with memory
>> allocations on the signalling path and trying to ensure that we don't
>> have that.
>>
>> I'm hoping that we don't need a shrinker which deals with (active) GPU
>> memory with our design.
> That's actually what we were planning to do: the panthor shrinker was
> about to rely on fences attached to GEM objects to know if it can
> reclaim the memory. This design relies on each job attaching its fence
> to the GEM mapped to the VM at the time the job is submitted, such that
> memory that's in-use or about-to-be-used doesn't vanish before the GPU
> is done.

Yeah and exactly that doesn't work any more when you are using user 
queues, because the kernel has no opportunity to attach a fence for each 
submission.

>> Memory which user space thinks the GPU might
>> need should be pinned before the GPU work is submitted. APIs which
>> require any form of 'paging in' of data would need to be implemented by
>> the GPU work completing and being resubmitted by user space after the
>> memory changes (i.e. there could be a DMA fence pending on the GPU work).
> Hard pinning memory could work (ioctl() around gem_pin/unpin()), but
> that means we can't really transparently swap out GPU memory, or we
> have to constantly pin/unpin around each job, which means even more
> ioctl()s than we have now. Another option would be to add the XGS fence
> to the BOs attached to the VM, assuming it's created before the job
> submission itself, but you're no longer reducing the number of user <->
> kernel round trips if you do that, because you now have to create an
> XSG job for each submission, so you basically get back to one ioctl()
> per submission.

For AMDGPU we are currently working on the following solution with 
memory management and user queues:

1. User queues are created through an kernel IOCTL, submissions work by 
writing into a ring buffer and ringing a doorbell.

2. Each queue can request the kernel to create fences for the currently 
pushed work for a queues which can then be attached to BOs, syncobjs, 
syncfiles etc...

3. Additional to that we have and eviction/preemption fence attached to 
all BOs, page tables, whatever resources we need.

4. When this eviction fences are requested to signal they first wait for 
all submission fences and then suspend the user queues and block 
creating new submission fences until the queues are restarted again.

This way you can still do your memory management inside the kernel (e.g. 
move BOs from local to system memory) or even completely suspend and 
resume applications without their interaction, but as Sima said it is 
just horrible complicated to get right.

We have been working on this for like two years now and it still could 
be that we missed something since it is not in production testing yet.

Regards,
Christian.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20240904/22afb038/attachment-0001.htm>