[PATCH] RFC: dma-fence: Document recoverable page fault implications

Wed Jan 27 12:16:59 UTC 2021

Am 27.01.21 um 13:11 schrieb Maarten Lankhorst:
> Op 27-01-2021 om 01:22 schreef Felix Kuehling:
>> Am 2021-01-21 um 2:40 p.m. schrieb Daniel Vetter:
>>> Recently there was a fairly long thread about recoreable hardware page
>>> faults, how they can deadlock, and what to do about that.
>>>
>>> While the discussion is still fresh I figured good time to try and
>>> document the conclusions a bit.
>>>
>>> References: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210107030127.20393-1-Felix.Kuehling%40amd.com%2F&data=04%7C01%7Cchristian.koenig%40amd.com%7Cbee0aeff80f440bcc52108d8c2bcc11f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637473463245588199%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ncr%2Fqv5lw0ONrYxFvfdcFAXAZ%2BXcJJa6UY%2BxGfcKGVM%3D&reserved=0
>>> Cc: Maarten Lankhorst <maarten.lankhorst at linux.intel.com>
>>> Cc: Thomas Hellström <thomas.hellstrom at intel.com>
>>> Cc: "Christian König" <christian.koenig at amd.com>
>>> Cc: Jerome Glisse <jglisse at redhat.com>
>>> Cc: Felix Kuehling <felix.kuehling at amd.com>
>>> Signed-off-by: Daniel Vetter <daniel.vetter at intel.com>
>>> Cc: Sumit Semwal <sumit.semwal at linaro.org>
>>> Cc: linux-media at vger.kernel.org
>>> Cc: linaro-mm-sig at lists.linaro.org
>>> --
>>> I'll be away next week, but figured I'll type this up quickly for some
>>> comments and to check whether I got this all roughly right.
>>>
>>> Critique very much wanted on this, so that we can make sure hw which
>>> can't preempt (with pagefaults pending) like gfx10 has a clear path to
>>> support page faults in upstream. So anything I missed, got wrong or
>>> like that would be good.
>>> -Daniel
>>> ---
>>>   Documentation/driver-api/dma-buf.rst | 66 ++++++++++++++++++++++++++++
>>>   1 file changed, 66 insertions(+)
>>>
>>> diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst
>>> index a2133d69872c..e924c1e4f7a3 100644
>>> --- a/Documentation/driver-api/dma-buf.rst
>>> +++ b/Documentation/driver-api/dma-buf.rst
>>> @@ -257,3 +257,69 @@ fences in the kernel. This means:
>>>     userspace is allowed to use userspace fencing or long running compute
>>>     workloads. This also means no implicit fencing for shared buffers in these
>>>     cases.
>>> +
>>> +Recoverable Hardware Page Faults Implications
>>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> +
>>> +Modern hardware supports recoverable page faults, which has a lot of
>>> +implications for DMA fences.
>>> +
>>> +First, a pending page fault obviously holds up the work that's running on the
>>> +accelerator and a memory allocation is usually required to resolve the fault.
>>> +But memory allocations are not allowed to gate completion of DMA fences, which
>>> +means any workload using recoverable page faults cannot use DMA fences for
>>> +synchronization. Synchronization fences controlled by userspace must be used
>>> +instead.
>>> +
>>> +On GPUs this poses a problem, because current desktop compositor protocols on
>>> +Linus rely on DMA fences, which means without an entirely new userspace stack
>>> +built on top of userspace fences, they cannot benefit from recoverable page
>>> +faults. The exception is when page faults are only used as migration hints and
>>> +never to on-demand fill a memory request. For now this means recoverable page
>>> +faults on GPUs are limited to pure compute workloads.
>>> +
>>> +Furthermore GPUs usually have shared resources between the 3D rendering and
>>> +compute side, like compute units or command submission engines. If both a 3D
>>> +job with a DMA fence and a compute workload using recoverable page faults are
>>> +pending they could deadlock:
>>> +
>>> +- The 3D workload might need to wait for the compute job to finish and release
>>> +  hardware resources first.
>>> +
>>> +- The compute workload might be stuck in a page fault, because the memory
>>> +  allocation is waiting for the DMA fence of the 3D workload to complete.
>>> +
>>> +There are a few ways to prevent this problem:
>>> +
>>> +- Compute workloads can always be preempted, even when a page fault is pending
>>> +  and not yet repaired. Not all hardware supports this.
>>> +
>>> +- DMA fence workloads and workloads which need page fault handling have
>>> +  independent hardware resources to guarantee forward progress. This could be
>>> +  achieved through e.g. through dedicated engines and minimal compute unit
>>> +  reservations for DMA fence workloads.
>>> +
>>> +- The reservation approach could be further refined by only reserving the
>>> +  hardware resources for DMA fence workloads when they are in-flight. This must
>>> +  cover the time from when the DMA fence is visible to other threads up to
>>> +  moment when fence is completed through dma_fence_signal().
>>> +
>>> +- As a last resort, if the hardware provides no useful reservation mechanics,
>>> +  all workloads must be flushed from the GPU when switching between jobs
>>> +  requiring DMA fences or jobs requiring page fault handling: This means all DMA
>>> +  fences must complete before a compute job with page fault handling can be
>>> +  inserted into the scheduler queue. And vice versa, before a DMA fence can be
>>> +  made visible anywhere in the system, all compute workloads must be preempted
>>> +  to guarantee all pending GPU page faults are flushed.
>> I thought of another possible workaround:
>>
>>    * Partition the memory. Servicing of page faults will use a separate
>>      memory pool that can always be allocated from without waiting for
>>      fences. This includes memory for page tables and memory for
>>      migrating data to. You may steal memory from other processes that
>>      can page fault, so no fence waiting is necessary. Being able to
>>      steal memory at any time also means there are basically no
>>      out-of-memory situations you need to worry about. Even page tables
>>      (except the root page directory of each process) can be stolen in
>>      the worst case.
> I think 'overcommit' would be a nice way to describe this. But I'm not
> sure how easy this is to implement in practice. You would basically need
> to create your own memory manager for this.

Well you would need a completely separate pool for both device as well 
as system memory.

E.g. on boot we say we steal X GB system memory only for HMM.

> But from a design point of view, definitely a valid solution.

I think the restriction above makes it pretty much unusable.

> But this looks good, those solutions are definitely the valid options we
> can choose from.

It's certainly worth noting, yes. And just to make sure that nobody has 
the idea to reserve only device memory.

Christian.

>
> ~Maarten
>