[PATCH 2/2] dmabuf/heaps: implement DMA_BUF_IOCTL_RW_FILE for system_heap
wangtao
tao.wangtao at honor.com
Thu May 22 08:02:06 UTC 2025
> -----Original Message-----
> From: Christian König <christian.koenig at amd.com>
> Sent: Wednesday, May 21, 2025 7:57 PM
> To: wangtao <tao.wangtao at honor.com>; T.J. Mercier
> <tjmercier at google.com>
> Cc: sumit.semwal at linaro.org; benjamin.gaignard at collabora.com;
> Brian.Starkey at arm.com; jstultz at google.com; linux-media at vger.kernel.org;
> dri-devel at lists.freedesktop.org; linaro-mm-sig at lists.linaro.org; linux-
> kernel at vger.kernel.org; wangbintian(BintianWang)
> <bintian.wang at honor.com>; yipengxiang <yipengxiang at honor.com>; liulu
> 00013167 <liulu.liu at honor.com>; hanfeng 00012985 <feng.han at honor.com>;
> amir73il at gmail.com
> Subject: Re: [PATCH 2/2] dmabuf/heaps: implement
> DMA_BUF_IOCTL_RW_FILE for system_heap
>
> On 5/21/25 12:25, wangtao wrote:
> > [wangtao] I previously explained that
> > read/sendfile/splice/copy_file_range
> > syscalls can't achieve dmabuf direct IO zero-copy.
>
> And why can't you work on improving those syscalls instead of creating a new
> IOCTL?
>
[wangtao] As I mentioned in previous emails, these syscalls cannot
achieve dmabuf zero-copy due to technical constraints. Could you
specify the technical points, code, or principles that need
optimization?
Let me explain again why these syscalls can't work:
1. read() syscall
- dmabuf fops lacks read callback implementation. Even if implemented,
file_fd info cannot be transferred
- read(file_fd, dmabuf_ptr, len) with remap_pfn_range-based mmap
cannot access dmabuf_buf pages, forcing buffer-mode reads
2. sendfile() syscall
- Requires CPU copy from page cache to memory file(tmpfs/shmem):
[DISK] --DMA--> [page cache] --CPU copy--> [MEMORY file]
- CPU overhead (both buffer/direct modes involve copies):
55.08% do_sendfile
|- 55.08% do_splice_direct
|-|- 55.08% splice_direct_to_actor
|-|-|- 22.51% copy_splice_read
|-|-|-|- 16.57% f2fs_file_read_iter
|-|-|-|-|- 15.12% __iomap_dio_rw
|-|-|- 32.33% direct_splice_actor
|-|-|-|- 32.11% iter_file_splice_write
|-|-|-|-|- 28.42% vfs_iter_write
|-|-|-|-|-|- 28.42% do_iter_write
|-|-|-|-|-|-|- 28.39% shmem_file_write_iter
|-|-|-|-|-|-|-|- 24.62% generic_perform_write
|-|-|-|-|-|-|-|-|- 18.75% __pi_memmove
3. splice() requires one end to be a pipe, incompatible with regular files or dmabuf.
4. copy_file_range()
- Blocked by cross-FS restrictions (Amir's commit 868f9f2f8e00)
- Even without this restriction, Even without restrictions, implementing
the copy_file_range callback in dmabuf fops would only allow dmabuf read
from regular files. This is because copy_file_range relies on
file_out->f_op->copy_file_range, which cannot support dmabuf write
operations to regular files.
Test results confirm these limitations:
T.J. Mercier's 1G from ext4 on 6.12.20 | read/sendfile (ms) w/ 3 > drop_caches
------------------------|-------------------
udmabuf buffer read | 1210
udmabuf direct read | 671
udmabuf buffer sendfile | 1096
udmabuf direct sendfile | 2340
My 3GHz CPU tests (cache cleared):
Method | alloc | read | vs. (%)
-----------------------------------------------
udmabuf buffer read | 135 | 546 | 180%
udmabuf direct read | 159 | 300 | 99%
udmabuf buffer sendfile | 134 | 303 | 100%
udmabuf direct sendfile | 141 | 912 | 301%
dmabuf buffer read | 22 | 362 | 119%
my patch direct read | 29 | 265 | 87%
My 1GHz CPU tests (cache cleared):
Method | alloc | read | vs. (%)
-----------------------------------------------
udmabuf buffer read | 552 | 2067 | 198%
udmabuf direct read | 540 | 627 | 60%
udmabuf buffer sendfile | 497 | 1045 | 100%
udmabuf direct sendfile | 527 | 2330 | 223%
dmabuf buffer read | 40 | 1111 | 106%
patch direct read | 44 | 310 | 30%
Test observations align with expectations:
1. dmabuf buffer read requires slow CPU copies
2. udmabuf direct read achieves zero-copy but has page retrieval
latency from vaddr
3. udmabuf buffer sendfile suffers CPU copy overhead
4. udmabuf direct sendfile combines CPU copies with frequent DMA
operations due to small pipe buffers
5. dmabuf buffer read also requires CPU copies
6. My direct read patch enables zero-copy with better performance
on low-power CPUs
7. udmabuf creation time remains problematic (as you’ve noted).
> > My focus is enabling dmabuf direct I/O for [regular file] <--DMA-->
> > [dmabuf] zero-copy.
>
> Yeah and that focus is wrong. You need to work on a general solution to the
> issue and not specific to your problem.
>
> > Any API achieving this would work. Are there other uAPIs you think
> > could help? Could you recommend experts who might offer suggestions?
>
> Well once more: Either work on sendfile or copy_file_range or eventually
> splice to make it what you want to do.
>
> When that is done we can discuss with the VFS people if that approach is
> feasible.
>
> But just bypassing the VFS review by implementing a DMA-buf specific IOCTL
> is a NO-GO. That is clearly not something you can do in any way.
[wangtao] The issue is that only dmabuf lacks Direct I/O zero-copy support. Tmpfs/shmem
already work with Direct I/O zero-copy. As explained, existing syscalls or
generic methods can't enable dmabuf direct I/O zero-copy, which is why I
propose adding an IOCTL command.
I respect your perspective. Could you clarify specific technical aspects,
code requirements, or implementation principles for modifying sendfile()
or copy_file_range()? This would help advance our discussion.
Thank you for engaging in this dialogue.
>
> Regards,
> Christian.
More information about the dri-devel
mailing list