[PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

Tue Jun 10 10:52:18 UTC 2025

On 6/9/25 11:32, wangtao wrote:
>> -----Original Message-----
>> From: Christoph Hellwig <hch at infradead.org>
>> Sent: Monday, June 9, 2025 12:35 PM
>> To: Christian König <christian.koenig at amd.com>
>> Cc: wangtao <tao.wangtao at honor.com>; Christoph Hellwig
>> <hch at infradead.org>; sumit.semwal at linaro.org; kraxel at redhat.com;
>> vivek.kasireddy at intel.com; viro at zeniv.linux.org.uk; brauner at kernel.org;
>> hughd at google.com; akpm at linux-foundation.org; amir73il at gmail.com;
>> benjamin.gaignard at collabora.com; Brian.Starkey at arm.com;
>> jstultz at google.com; tjmercier at google.com; jack at suse.cz;
>> baolin.wang at linux.alibaba.com; linux-media at vger.kernel.org; dri-
>> devel at lists.freedesktop.org; linaro-mm-sig at lists.linaro.org; linux-
>> kernel at vger.kernel.org; linux-fsdevel at vger.kernel.org; linux-
>> mm at kvack.org; wangbintian(BintianWang) <bintian.wang at honor.com>;
>> yipengxiang <yipengxiang at honor.com>; liulu 00013167
>> <liulu.liu at honor.com>; hanfeng 00012985 <feng.han at honor.com>
>> Subject: Re: [PATCH v4 0/4] Implement dmabuf direct I/O via
>> copy_file_range
>>
>> On Fri, Jun 06, 2025 at 01:20:48PM +0200, Christian König wrote:
>>>> dmabuf acts as a driver and shouldn't be handled by VFS, so I made
>>>> dmabuf implement copy_file_range callbacks to support direct I/O
>>>> zero-copy. I'm open to both approaches. What's the preference of VFS
>>>> experts?
>>>
>>> That would probably be illegal. Using the sg_table in the DMA-buf
>>> implementation turned out to be a mistake.
>>
>> Two thing here that should not be directly conflated.  Using the sg_table was
>> a huge mistake, and we should try to move dmabuf to switch that to a pure
> I'm a bit confused: don't dmabuf importers need to traverse sg_table to
> access folios or dma_addr/len? Do you mean restricting sg_table access
> (e.g., only via iov_iter) or proposing alternative approaches?

No, accessing pages folios inside the sg_table of a DMA-buf is strictly forbidden.

We have removed most use cases of that over the years and push back on generating new ones. 

> 
>> dma_addr_t/len array now that the new DMA API supporting that has been
>> merged.  Is there any chance the dma-buf maintainers could start to kick this
>> off?  I'm of course happy to assist.

Work on that is already underway for some time.

Most GPU drivers already do sg_table -> DMA array conversion, I need to push on the remaining to clean up.

But there are also tons of other users of dma_buf_map_attachment() which needs to be converted.

>> But that notwithstanding, dma-buf is THE buffer sharing mechanism in the
>> kernel, and we should promote it instead of reinventing it badly.
>> And there is a use case for having a fully DMA mapped buffer in the block
>> layer and I/O path, especially on systems with an IOMMU.
>> So having an iov_iter backed by a dma-buf would be extremely helpful.
>> That's mostly lib/iov_iter.c code, not VFS, though.
> Are you suggesting adding an ITER_DMABUF type to iov_iter, or
> implementing dmabuf-to-iov_bvec conversion within iov_iter?

That would be rather nice to have, yeah.

> 
>>
>>> The question Christoph raised was rather why is your CPU so slow that
>>> walking the page tables has a significant overhead compared to the
>>> actual I/O?
>>
>> Yes, that's really puzzling and should be addressed first.
> With high CPU performance (e.g., 3GHz), GUP (get_user_pages) overhead
> is relatively low (observed in 3GHz tests).

Even on a low end CPU walking the page tables and grabbing references shouldn't be that much of an overhead.

There must be some reason why you see so much CPU overhead. E.g. compound pages are broken up or similar which should not happen in the first place.

Regards,
Christian.

> |    32x32MB Read 1024MB    |Creat-ms|Close-ms|  I/O-ms|I/O-MB/s| I/O%
> |---------------------------|--------|--------|--------|--------|-----
> | 1)        memfd direct R/W|      1 |    118 |    312 |   3448 | 100%
> | 2)      u+memfd direct R/W|    196 |    123 |    295 |   3651 | 105%
> | 3) u+memfd direct sendfile|    175 |    102 |    976 |   1100 |  31%
> | 4)   u+memfd direct splice|    173 |    103 |    443 |   2428 |  70%
> | 5)      udmabuf buffer R/W|    183 |    100 |    453 |   2375 |  68%
> | 6)       dmabuf buffer R/W|     34 |      4 |    427 |   2519 |  73%
> | 7)    udmabuf direct c_f_r|    200 |    102 |    278 |   3874 | 112%
> | 8)     dmabuf direct c_f_r|     36 |      5 |    269 |   4002 | 116%
> 
> With lower CPU performance (e.g., 1GHz), GUP overhead becomes more
> significant (as seen in 1GHz tests).
> |    32x32MB Read 1024MB    |Creat-ms|Close-ms|  I/O-ms|I/O-MB/s| I/O%
> |---------------------------|--------|--------|--------|--------|-----
> | 1)        memfd direct R/W|      2 |    393 |    969 |   1109 | 100%
> | 2)      u+memfd direct R/W|    592 |    424 |    570 |   1884 | 169%
> | 3) u+memfd direct sendfile|    587 |    356 |   2229 |    481 |  43%
> | 4)   u+memfd direct splice|    568 |    352 |    795 |   1350 | 121%
> | 5)      udmabuf buffer R/W|    597 |    343 |   1238 |    867 |  78%
> | 6)       dmabuf buffer R/W|     69 |     13 |   1128 |    952 |  85%
> | 7)    udmabuf direct c_f_r|    595 |    345 |    372 |   2889 | 260%
> | 8)     dmabuf direct c_f_r|     80 |     13 |    274 |   3929 | 354%
> 
> Regards,
> Wangtao.