[PATCH 2/2] dmabuf/heaps: implement DMA_BUF_IOCTL_RW_FILE for system_heap

Thu May 22 12:29:46 UTC 2025

Apologies for interrupting the filesystem/memory experts. Due to dmabuf's
attachment/map/fence model, its mmap callback uses remap_pfn_range, making
read(file_fd, dmabuf_ptr, len) support buffer I/O only, not Direct I/O
zero-copy. Embedded/mobile devices urgently require dmabuf Direct I/O for
large-file operations, with prior patches attempting this.

While tmpfs/shmem support Direct I/O zero-copy, dmabuf does not. My patch
adds an ioctl command for dmabuf Direct I/O zero-copy, achieving >80%
bandwidth even on low-power CPUs.

Christian argues udmabuf + sendfile/splice/copy_file_range could enable
zero-copy, but analysis and testing (detailed prior email) show these
syscalls fail for high-performance dmabuf Direct I/O:
1. sendfile(dst_memfile, src_disk): Requires page cache copies
[DISK] --DMA--> [page cache] --CPU copy--> [MEMORY file]
2. splice: Requires pipe endpoint (incompatible with files/dmabuf)
3. copy_file_range: Cross-FS prohibited

Technical question: Under fs/mm layer constraints, can/how should we modify
sendfile/splice/copy_file_range (or other syscalls) to achieve efficient
dmabuf Direct I/O zero-copy? Your insights on required syscall modifications
would be invaluable. Thank you for guidance.

> -----Original Message-----
> From: Christian König <christian.koenig at amd.com>
> Sent: Thursday, May 22, 2025 7:58 PM
> To: wangtao <tao.wangtao at honor.com>; T.J. Mercier
> <tjmercier at google.com>
> Cc: sumit.semwal at linaro.org; benjamin.gaignard at collabora.com;
> Brian.Starkey at arm.com; jstultz at google.com; linux-media at vger.kernel.org;
> dri-devel at lists.freedesktop.org; linaro-mm-sig at lists.linaro.org; linux-
> kernel at vger.kernel.org; wangbintian(BintianWang)
> <bintian.wang at honor.com>; yipengxiang <yipengxiang at honor.com>; liulu
> 00013167 <liulu.liu at honor.com>; hanfeng 00012985 <feng.han at honor.com>;
> amir73il at gmail.com
> Subject: Re: [PATCH 2/2] dmabuf/heaps: implement
> DMA_BUF_IOCTL_RW_FILE for system_heap
> 
> On 5/22/25 10:02, wangtao wrote:
> >> -----Original Message-----
> >> From: Christian König <christian.koenig at amd.com>
> >> Sent: Wednesday, May 21, 2025 7:57 PM
> >> To: wangtao <tao.wangtao at honor.com>; T.J. Mercier
> >> <tjmercier at google.com>
> >> Cc: sumit.semwal at linaro.org; benjamin.gaignard at collabora.com;
> >> Brian.Starkey at arm.com; jstultz at google.com;
> >> linux-media at vger.kernel.org; dri-devel at lists.freedesktop.org;
> >> linaro-mm-sig at lists.linaro.org; linux- kernel at vger.kernel.org;
> >> wangbintian(BintianWang) <bintian.wang at honor.com>; yipengxiang
> >> <yipengxiang at honor.com>; liulu
> >> 00013167 <liulu.liu at honor.com>; hanfeng 00012985
> >> <feng.han at honor.com>; amir73il at gmail.com
> >> Subject: Re: [PATCH 2/2] dmabuf/heaps: implement
> >> DMA_BUF_IOCTL_RW_FILE for system_heap
> >>
> >> On 5/21/25 12:25, wangtao wrote:
> >>> [wangtao] I previously explained that
> >>> read/sendfile/splice/copy_file_range
> >>> syscalls can't achieve dmabuf direct IO zero-copy.
> >>
> >> And why can't you work on improving those syscalls instead of
> >> creating a new IOCTL?
> >>
> > [wangtao] As I mentioned in previous emails, these syscalls cannot
> > achieve dmabuf zero-copy due to technical constraints.
> 
> Yeah, and why can't you work on removing those technical constrains?
> 
> What is blocking you from improving the sendfile system call or proposing a
> patch to remove the copy_file_range restrictions?
> 
> Regards,
> Christian.
> 
>  Could you
> > specify the technical points, code, or principles that need
> > optimization?
> >
> > Let me explain again why these syscalls can't work:
> > 1. read() syscall
> >    - dmabuf fops lacks read callback implementation. Even if implemented,
> >      file_fd info cannot be transferred
> >    - read(file_fd, dmabuf_ptr, len) with remap_pfn_range-based mmap
> >      cannot access dmabuf_buf pages, forcing buffer-mode reads
> >
> > 2. sendfile() syscall
> >    - Requires CPU copy from page cache to memory file(tmpfs/shmem):
> >      [DISK] --DMA--> [page cache] --CPU copy--> [MEMORY file]
> >    - CPU overhead (both buffer/direct modes involve copies):
> >      55.08% do_sendfile
> >     |- 55.08% do_splice_direct
> >     |-|- 55.08% splice_direct_to_actor
> >     |-|-|- 22.51% copy_splice_read
> >     |-|-|-|- 16.57% f2fs_file_read_iter
> >     |-|-|-|-|- 15.12% __iomap_dio_rw
> >     |-|-|- 32.33% direct_splice_actor
> >     |-|-|-|- 32.11% iter_file_splice_write
> >     |-|-|-|-|- 28.42% vfs_iter_write
> >     |-|-|-|-|-|- 28.42% do_iter_write
> >     |-|-|-|-|-|-|- 28.39% shmem_file_write_iter
> >     |-|-|-|-|-|-|-|- 24.62% generic_perform_write
> >     |-|-|-|-|-|-|-|-|- 18.75% __pi_memmove
> >
> > 3. splice() requires one end to be a pipe, incompatible with regular files or
> dmabuf.
> >
> > 4. copy_file_range()
> >    - Blocked by cross-FS restrictions (Amir's commit 868f9f2f8e00)
> >    - Even without this restriction, Even without restrictions, implementing
> >      the copy_file_range callback in dmabuf fops would only allow dmabuf
> read
> > 	 from regular files. This is because copy_file_range relies on
> > 	 file_out->f_op->copy_file_range, which cannot support dmabuf
> write
> > 	 operations to regular files.
> >
> > Test results confirm these limitations:
> > T.J. Mercier's 1G from ext4 on 6.12.20 | read/sendfile (ms) w/ 3 >
> > drop_caches
> > ------------------------|-------------------
> > udmabuf buffer read     | 1210
> > udmabuf direct read     | 671
> > udmabuf buffer sendfile | 1096
> > udmabuf direct sendfile | 2340
> >
> > My 3GHz CPU tests (cache cleared):
> > Method                | alloc | read  | vs. (%)
> > -----------------------------------------------
> > udmabuf buffer read   | 135   | 546   | 180%
> > udmabuf direct read   | 159   | 300   | 99%
> > udmabuf buffer sendfile | 134 | 303   | 100%
> > udmabuf direct sendfile | 141 | 912   | 301%
> > dmabuf buffer read    | 22    | 362   | 119%
> > my patch direct read  | 29    | 265   | 87%
> >
> > My 1GHz CPU tests (cache cleared):
> > Method                | alloc | read  | vs. (%)
> > -----------------------------------------------
> > udmabuf buffer read   | 552   | 2067  | 198%
> > udmabuf direct read   | 540   | 627   | 60%
> > udmabuf buffer sendfile | 497 | 1045  | 100% udmabuf direct sendfile |
> > 527 | 2330  | 223%
> > dmabuf buffer read    | 40    | 1111  | 106%
> > patch direct read     | 44    | 310   | 30%
> >
> > Test observations align with expectations:
> > 1. dmabuf buffer read requires slow CPU copies 2. udmabuf direct read
> > achieves zero-copy but has page retrieval
> >    latency from vaddr
> > 3. udmabuf buffer sendfile suffers CPU copy overhead 4. udmabuf direct
> > sendfile combines CPU copies with frequent DMA
> >    operations due to small pipe buffers 5. dmabuf buffer read also
> > requires CPU copies 6. My direct read patch enables zero-copy with
> > better performance
> >    on low-power CPUs
> > 7. udmabuf creation time remains problematic (as you’ve noted).
> >
> >>> My focus is enabling dmabuf direct I/O for [regular file] <--DMA-->
> >>> [dmabuf] zero-copy.
> >>
> >> Yeah and that focus is wrong. You need to work on a general solution
> >> to the issue and not specific to your problem.
> >>
> >>> Any API achieving this would work. Are there other uAPIs you think
> >>> could help? Could you recommend experts who might offer suggestions?
> >>
> >> Well once more: Either work on sendfile or copy_file_range or
> >> eventually splice to make it what you want to do.
> >>
> >> When that is done we can discuss with the VFS people if that approach
> >> is feasible.
> >>
> >> But just bypassing the VFS review by implementing a DMA-buf specific
> >> IOCTL is a NO-GO. That is clearly not something you can do in any way.
> > [wangtao] The issue is that only dmabuf lacks Direct I/O zero-copy
> > support. Tmpfs/shmem already work with Direct I/O zero-copy. As
> > explained, existing syscalls or generic methods can't enable dmabuf
> > direct I/O zero-copy, which is why I propose adding an IOCTL command.
> >
> > I respect your perspective. Could you clarify specific technical
> > aspects, code requirements, or implementation principles for modifying
> > sendfile() or copy_file_range()? This would help advance our discussion.
> >
> > Thank you for engaging in this dialogue.
> >
> >>
> >> Regards,
> >> Christian.