[PATCH 0/4] cover-letter: Allow MMIO regions to be exported through dmabuf

Thu Dec 19 10:04:54 UTC 2024

Am 19.12.24 um 08:02 schrieb Kasireddy, Vivek:
> Hi Christian,
>
>> Subject: Re: [PATCH 0/4] cover-letter: Allow MMIO regions to be exported
>> through dmabuf
>>
>>>>> 	I will resend the patch series. I was experiencing issues with my email
>>>>> client, which inadvertently split the series into two separate emails.
>>>> Alternatively I can also copy the code from the list archive and explain why
>>>> that doesn't work:
>>>>
>>>> +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool
>>>> +revoked) {
>>>> +    struct vfio_pci_dma_buf *priv;
>>>> +    struct vfio_pci_dma_buf *tmp;
>>>> +
>>>> +    lockdep_assert_held_write(&vdev->memory_lock);
>>>>
>>>> This makes sure that the caller is holding vdev->memory_lock.
>>>>
>>>> +
>>>> +    list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
>>>> +        if (!dma_buf_try_get(priv->dmabuf))
>>>>
>>>> This here only works because vfio_pci_dma_buf_release() also grabs
>> vdev-
>>>>> memory_lock and so we are protected against concurrent releases.
>>>> +            continue;
>>>> +        if (priv->revoked != revoked) {
>>>> +            dma_resv_lock(priv->dmabuf->resv, NULL);
>>>> +            priv->revoked = revoked;
>>>> +            dma_buf_move_notify(priv->dmabuf);
>>>> +            dma_resv_unlock(priv->dmabuf->resv);
>>>> +        }
>>>> +        dma_buf_put(priv->dmabuf);
>>>>
>>>> The problem is now that this here might drop the last reference which in
>> turn
>>>> calls vfio_pci_dma_buf_release() which also tries to grab vdev-
>>>>> memory_lock and so results in a deadlock.
>>> AFAICS, vfio_pci_dma_buf_release() would not be called synchronously
>> after the
>>> last reference is dropped by dma_buf_put(). This is because fput(), which is
>> called
>>> by dma_buf_put() triggers f_op->release() asynchronously; therefore, a
>> deadlock
>>> is unlikely to occur in this scenario, unless I am overlooking something.
>> My recollection is that the f_op->release handler is only called
>> asynchronously if fput() was issued in interrupt context.
> Here is the code of fput() from the current master:
> void fput(struct file *file)
> {
>          if (file_ref_put(&file->f_ref)) {
>                  struct task_struct *task = current;
>
>                  if (unlikely(!(file->f_mode & (FMODE_BACKING | FMODE_OPENED)))) {
>                          file_free(file);
>                          return;
>                  }
>                  if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
>                          init_task_work(&file->f_task_work, ____fput);
>                          if (!task_work_add(task, &file->f_task_work, TWA_RESUME))
>                                  return;
>                          /*
>                           * After this task has run exit_task_work(),
>                           * task_work_add() will fail.  Fall through to delayed
>                           * fput to avoid leaking *file.
>                           */
>                  }
>
>                  if (llist_add(&file->f_llist, &delayed_fput_list))
>                          schedule_delayed_work(&delayed_fput_work, 1);
>          }
> }
>
> IIUC, based on the above code, it looks like there are two ways in which the
> f_op->release() handler is triggered from fput():
> - via delayed_fput() for kernel threads and code in interrupt context
> - via task_work_run() just before the task/process returns to the user-mode
>
> The first scenario above is definitely asynchronous as the release() handler is
> called from a worker thread. But I think the second case (which is the most
> common and relevant for our use-case) can also be considered asynchronous,
> because the execution of the system call or ioctl that led to the context in
> which dma_buf_put() was called is completed.
>
> Here is a trace from my light testing with the udmabuf driver, where you can
> see the release() handler being called by syscall_exit_to_user_mode() :
> [  158.464203] Call Trace:
> [  158.466681]  <TASK>
> [  158.468815]  dump_stack_lvl+0x60/0x80
> [  158.472507]  dump_stack+0x14/0x16
> [  158.475853]  release_udmabuf+0x2f/0x9f
> [  158.479631]  dma_buf_release+0x3c/0x90
> [  158.483408]  __dentry_kill+0x8f/0x180
> [  158.487098]  dput+0xe7/0x1a0
> [  158.490013]  __fput+0x131/0x2b0
> [  158.493178]  ____fput+0x19/0x20
> [  158.496352]  task_work_run+0x61/0x90
> [  158.499959]  syscall_exit_to_user_mode+0x1a4/0x1b0
> [  158.504769]  do_syscall_64+0x5b/0x110
> [  158.508458]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
>
> And, here is the relevant syscall code (from arch/x86/entry/common.c):
> __visible noinstr bool do_syscall_64(struct pt_regs *regs, int nr)
> {
>          add_random_kstack_offset();
>          nr = syscall_enter_from_user_mode(regs, nr);
>
>          instrumentation_begin();
>
>          if (!do_syscall_x64(regs, nr) && !do_syscall_x32(regs, nr) && nr != -1) {
>                  /* Invalid system call, but still a system call. */
>                  regs->ax = __x64_sys_ni_syscall(regs);
>          }
>
>          instrumentation_end();
>          syscall_exit_to_user_mode(regs);
>
> I also confirmed that the release() handler is indeed called after dma_buf_put()
> (and not by dma_buf_put()) by adding debug prints before and after
> dma_buf_put() and one in the release() handler. Furthermore, I also found
> that calling close() on the dmabuf fd in the user-space is one scenario in
> which fput() is called synchronously. Here is the relevant trace:
> [  302.770910] Call Trace:
> [  302.773389]  <TASK>
> [  302.775516]  dump_stack_lvl+0x60/0x80
> [  302.779209]  dump_stack+0x14/0x16
> [  302.782549]  release_udmabuf+0x2f/0x9f
> [  302.786329]  dma_buf_release+0x3c/0x90
> [  302.790105]  __dentry_kill+0x8f/0x180
> [  302.793789]  dput+0xe7/0x1a0
> [  302.796703]  __fput+0x131/0x2b0
> [  302.799866]  __fput_sync+0x53/0x70
> [  302.803299]  __x64_sys_close+0x58/0xc0
> [  302.807076]  x64_sys_call+0x126a/0x17d0
> [  302.810938]  do_syscall_64+0x4f/0x110
> [  302.814622]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
>
> As you can see above, there is indeed a synchronous version of fput() defined
> just below fput():
> /*
>   * synchronous analog of fput(); for kernel threads that might be needed
>   * in some umount() (and thus can't use flush_delayed_fput() without
>   * risking deadlocks), need to wait for completion of __fput() and know
>   * for this specific struct file it won't involve anything that would
>   * need them.  Use only if you really need it - at the very least,
>   * don't blindly convert fput() by kernel thread to that.
>   */
> void __fput_sync(struct file *file)
> {
> 	if (file_ref_put(&file->f_ref))
> 		__fput(file);
> }
>
> Based on all the above, I think we can conclude that since dma_buf_put()
> does not directly (or synchronously) call the f_op->release() handler, a
> deadlock is unlikely to occur in the scenario you described.

Yeah, I agree.

Interesting to know, I wasn't aware that the task_work functionality 
exists for that use case.

Thanks,
Christian.

>
> Thanks,
> Vivek
>
>> But could be that this information is outdated.
>>
>> Regards,
>> Christian.
>>
>>> Thanks,
>>> Vivek
>>>
>>>> +    }
>>>> +}
>>>>
>>>> This pattern was suggested multiple times and I had to rejected it every
>> time
>>>> because the whole idea is just fundamentally broken.
>>>>
>>>> It's really astonishing how people always come up with the same broken
>>>> pattern.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 		Apart from that I have to reject the adding of
>>>> dma_buf_try_get(), that is clearly not something we should do.
>>>>
>>>>
>>>>
>>>> 	Understood. It appears that Vivek has confirmed that his v2 has
>>>> resolved the issue. I will follow up with him to determine if he plans to
>>>> resume his patch, and if so, I will apply my last patch on top of his
>> updated
>>>> patch series
>>>>
>>>> 	Thanks,
>>>> 	Wei Lin
>>>>
>>>>
>>>> 		Thanks,
>>>> 		Christian.
>>>>
>>>>
>>>>
>>>>