[PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

Wed Jan 5 16:16:26 UTC 2022

Am 2022-01-05 um 3:08 a.m. schrieb Christian König:
> Am 04.01.22 um 19:08 schrieb Felix Kuehling:
>> [+Adrian]
>>
>> Am 2021-12-23 um 2:05 a.m. schrieb Christian König:
>>
>>> Am 22.12.21 um 21:53 schrieb Daniel Vetter:
>>>> On Mon, Dec 20, 2021 at 01:12:51PM -0500, Bhardwaj, Rajneesh wrote:
>>>>
>>>> [SNIP]
>>>> Still sounds funky. I think minimally we should have an ack from CRIU
>>>> developers that this is officially the right way to solve this
>>>> problem. I
>>>> really don't want to have random one-off hacks that don't work across
>>>> the
>>>> board, for a problem where we (drm subsystem) really shouldn't be the
>>>> only
>>>> one with this problem. Where "this problem" means that the mmap
>>>> space is
>>>> per file description, and not per underlying inode or real device or
>>>> whatever. That part sounds like a CRIU problem, and I expect CRIU
>>>> folks
>>>> want a consistent solution across the board for this. Hence please
>>>> grab an
>>>> ack from them.
>>> Unfortunately it's a KFD design problem. AMD used a single device
>>> node, then mmaped different objects from the same offset to different
>>> processes and expected it to work the rest of the fs subsystem without
>>> churn.
>> This may be true for mmaps in the KFD device, but not for mmaps in the
>> DRM render nodes.
>
> Correct, yes.
>
>>> So yes, this is indeed because the mmap space is per file descriptor
>>> for the use case here.
>> No. This is a different problem.
>
> I was already wondering which mmaps through the KFD node we have left
> which cause problems here.

We still use the KFD FD for mapping doorbells and HDP flushing. These
are both SG BOs, so they cannot be CPU-mapped through render nodes. The
KFD FD is also used for mapping signal pages and CWSR trap handlers on
old APUs.

Those VMAs aren't causing the problem. They still map successfully on
restore.

>
>> The problem has to do with the way that DRM manages mmap permissions. In
>> order to be able to mmap an offset in the render node, there needs to be
>> a BO that was created in the same render node. If you fork a process, it
>> inherits the VMA.
>
> Yeah, so far it works like designed.
>
>> But KFD doesn't know anything about the inherited BOs
>> from the parent process.
>
> Ok, why that? When the KFD is reinitializing it's context why
> shouldn't it cleanup those VMAs?

That cleanup has to be initiated by user mode. Basically closing the old
KFD and DRM file descriptors, cleaning up all the user mode VM state,
unmapping all the VMAs, etc. Then it reopens KFD and the render nodes
and starts from scratch.

User mode will do this automatically when it tries to reinitialize ROCm.
However, in this case the child process doesn't do that (e.g. a python
application using the multi-processing package). The child process does
not use ROCm. But you're left with all the dangling VMAs in the child
process indefinitely.

Regards,
  Felix

>
>> Therefore those BOs don't get checkpointed and
>> restored in the child process. When the CRIU checkpoint is restored, our
>> CRIU plugin never creates a BO corresponding to the VMA in the child
>> process' render node FD. We've also lost the relationship between the
>> parent and child-process' render node FDs. After "fork" the render node
>> FD points to the same struct file in parent and child. After restoring
>> the CRIU checkpoint, they are separate struct files, created by separate
>> "open" system calls. Therefore the mmap call that restores the VMA fails
>> in the child process.
>>
>> At least for KFD, there is no point inheriting BOs from a child process,
>> because the GPU has no way of accessing the BOs in the child process.
>> The child process has no GPU address space, no user mode queues, no way
>> to do anything with the GPU before it completely reinitializes its KFD
>> context.
>>
>> We can workaround this issue in user mode with madvise(...,
>> MADV_DONTFORK). In fact we've already done this for some BOs to avoid a
>> memory leak in the parent process while a child process exists. But it's
>> slightly racy because there is a short time window where VMA exists
>> without the VM_DONTCOPY flag. A fork during that time window could still
>> create a child process with an inherited VMA.
>>
>> Therefore a safer solution is to set the vm_flags in the VMA in the
>> driver when the VMA is first created.
>
> Thanks for the full explanation, it makes much more sense now.
>
> Regards,
> Christian.
>
>>
>> Regards,
>>    Felix
>>
>>
>>> And thanks for pointing this out, this indeed makes the whole change
>>> extremely questionable.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> Cheers, Daniel
>>>>
>