[PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag

Tue Jul 30 10:43:32 UTC 2024

Am 30.07.24 um 10:46 schrieb Huan Yang:
>
> 在 2024/7/30 16:37, Christian König 写道:
>> Am 30.07.24 um 10:14 schrieb Huan Yang:
>>> 在 2024/7/30 16:03, Christian König 写道:
>>>> Am 30.07.24 um 09:57 schrieb Huan Yang:
>>>>> Background
>>>>> ====
>>>>> Some user may need load file into dma-buf, current way is:
>>>>>    1. allocate a dma-buf, get dma-buf fd
>>>>>    2. mmap dma-buf fd into user vaddr
>>>>>    3. read(file_fd, vaddr, fsz)
>>>>> Due to dma-buf user map can't support direct I/O[1], the file read
>>>>> must be buffer I/O.
>>>>>
>>>>> This means that during the process of reading the file into dma-buf,
>>>>> page cache needs to be generated, and the corresponding content 
>>>>> needs to
>>>>> be first copied to the page cache before being copied to the dma-buf.
>>>>>
>>>>> This way worked well when reading relatively small files before, as
>>>>> the page cache can cache the file content, thus improving 
>>>>> performance.
>>>>>
>>>>> However, there are new challenges currently, especially as AI 
>>>>> models are
>>>>> becoming larger and need to be shared between DMA devices and the CPU
>>>>> via dma-buf.
>>>>>
>>>>> For example, our 7B model file size is around 3.4GB. Using the
>>>>> previous would mean generating a total of 3.4GB of page cache
>>>>> (even if it will be reclaimed), and also requiring the copying of 
>>>>> 3.4GB
>>>>> of content between page cache and dma-buf.
>>>>>
>>>>> Due to the limited resources of system memory, files in the 
>>>>> gigabyte range
>>>>> cannot persist in memory indefinitely, so this portion of page 
>>>>> cache may
>>>>> not provide much assistance for subsequent reads. Additionally, the
>>>>> existence of page cache will consume additional system resources 
>>>>> due to
>>>>> the extra copying required by the CPU.
>>>>>
>>>>> Therefore, I think it is necessary for dma-buf to support direct I/O.
>>>>>
>>>>> However, direct I/O file reads cannot be performed using the buffer
>>>>> mmaped by the user space for the dma-buf.[1]
>>>>>
>>>>> Here are some discussions on implementing direct I/O using dma-buf:
>>>>>
>>>>> mmap[1]
>>>>> ---
>>>>> dma-buf never support user map vaddr use of direct I/O.
>>>>>
>>>>> udmabuf[2]
>>>>> ---
>>>>> Currently, udmabuf can use the memfd method to read files into
>>>>> dma-buf in direct I/O mode.
>>>>>
>>>>> However, if the size is large, the current udmabuf needs to adjust 
>>>>> the
>>>>> corresponding size_limit(default 64MB).
>>>>> But using udmabuf for files at the 3GB level is not a very good 
>>>>> approach.
>>>>> It needs to make some adjustments internally to handle this.[3] Or 
>>>>> else,
>>>>> fail create.
>>>>>
>>>>> But, it is indeed a viable way to enable dma-buf to support direct 
>>>>> I/O.
>>>>> However, it is necessary to initiate the file read after the 
>>>>> memory allocation
>>>>> is completed, and handle race conditions carefully.
>>>>>
>>>>> sendfile/splice[4]
>>>>> ---
>>>>> Another way to enable dma-buf to support direct I/O is by 
>>>>> implementing
>>>>> splice_write/write_iter in the dma-buf file operations (fops) to 
>>>>> adapt
>>>>> to the sendfile method.
>>>>> However, the current sendfile/splice calls are based on pipe. When 
>>>>> using
>>>>> direct I/O to read a file, the content needs to be copied to the 
>>>>> buffer
>>>>> allocated by the pipe (default 64KB), and then the dma-buf fops'
>>>>> splice_write needs to be called to write the content into the 
>>>>> dma-buf.
>>>>> This approach requires serially reading the content of file pipe size
>>>>> into the pipe buffer and then waiting for the dma-buf to be written
>>>>> before reading the next one.(The I/O performance is relatively weak
>>>>> under direct I/O.)
>>>>> Moreover, due to the existence of the pipe buffer, even when using
>>>>> direct I/O and not needing to generate additional page cache,
>>>>> there still needs to be a CPU copy.
>>>>>
>>>>> copy_file_range[5]
>>>>> ---
>>>>> Consider of copy_file_range, It only supports copying files within 
>>>>> the
>>>>> same file system. Similarly, it is not very practical.
>>>>>
>>>>>
>>>>> So, currently, there is no particularly suitable solution on VFS to
>>>>> allow dma-buf to support direct I/O for large file reads.
>>>>>
>>>>> This patchset provides an idea to complete file reads when 
>>>>> requesting a
>>>>> dma-buf.
>>>>>
>>>>> Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
>>>>> ===
>>>>> This patch provides a method to immediately read the file content 
>>>>> after
>>>>> the dma-buf is allocated, and only returns the dma-buf file 
>>>>> descriptor
>>>>> after the file is fully read.
>>>>>
>>>>> Since the dma-buf file descriptor is not returned, no other thread 
>>>>> can
>>>>> access it except for the current thread, so we don't need to worry 
>>>>> about
>>>>> race conditions.
>>>>
>>>> That is a completely false assumption.
>>> Can you provide a detailed explanation as to why this assumption is 
>>> incorrect? thanks.
>>
>> File descriptors can be guessed and is available to userspace as soon 
>> as dma_buf_fd() is called.
>>
>> What could potentially work is to call system_heap_allocate() without 
>> calling dma_buf_fd(), but I'm not sure if you can then make I/O to 
>> the underlying pages.
>
> Actually, the dma-buf file descriptor is obtained only after a 
> successful file read in the code, so there is no issue.
>
> If you are interested, you can take a look at the 
> dma_heap_buffer_alloc_and_read function in patch1.
>
>>
>>>>
>>>>>
>>>>> Map the dma-buf to the vmalloc area and initiate file reads in kernel
>>>>> space, supporting both buffer I/O and direct I/O.
>>>>>
>>>>> This patch adds the DMA_HEAP_ALLOC_AND_READ heap_flag for user.
>>>>> When a user needs to allocate a dma-buf and read a file, they should
>>>>> pass this heap flag. As the size of the file being read is fixed, 
>>>>> there is no
>>>>> need to pass the 'len' parameter. Instead, The file_fd needs to be 
>>>>> passed to
>>>>> indicate to the kernel the file that needs to be read.
>>>>>
>>>>> The file open flag determines the mode of file reading.
>>>>> But, please note that if direct I/O(O_DIRECT) is needed to read 
>>>>> the file,
>>>>> the file size must be page aligned. (with patch 2-5, no need)
>>>>>
>>>>> Therefore, for the user, len and file_fd are mutually exclusive,
>>>>> and they are combined using a union.
>>>>>
>>>>> Once the user obtains the dma-buf fd, the dma-buf directly 
>>>>> contains the
>>>>> file content.
>>>>
>>>> And I'm repeating myself, but this is a complete NAK from my side 
>>>> to this approach.
>>>>
>>>> We pointed out multiple ways of how to implement this cleanly and 
>>>> not by hacking functionality into the kernel which absolutely 
>>>> doesn't belong there.
>>> In this patchset, I have provided performance comparisons of each of 
>>> these methods.  Can you please provide more opinions?
>>
>> Either drop the whole approach or change udmabuf to do what you want 
>> to do.
> OK, if so, do I need to send a patch to make dma-buf support sendfile?

Well the udmabuf approach doesn't need to use sendfile, so no.

Regards,
Christian.

>
>>
>> Apart from that I don't see a doable way which can be accepted into 
>> the kernel.
> Thanks for your suggestion.
>>
>> Regards,
>> Christian.
>>
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> Patch 1 implement it.
>>>>>
>>>>> Patch 2-5 provides an approach for performance improvement.
>>>>>
>>>>> The DMA_HEAP_ALLOC_AND_READ_FILE heap flag patch enables us to
>>>>> synchronously read files using direct I/O.
>>>>>
>>>>> This approach helps to save CPU copying and avoid a certain degree of
>>>>> memory thrashing (page cache generation and reclamation)
>>>>>
>>>>> When dealing with large file sizes, the benefits of this approach 
>>>>> become
>>>>> particularly significant.
>>>>>
>>>>> However, there are currently some methods that can improve 
>>>>> performance,
>>>>> not just save system resources:
>>>>>
>>>>> Due to the large file size, for example, a AI 7B model of around 
>>>>> 3.4GB, the
>>>>> time taken to allocate DMA-BUF memory will be relatively long. 
>>>>> Waiting
>>>>> for the allocation to complete before reading the file will add to 
>>>>> the
>>>>> overall time consumption. Therefore, the total time for DMA-BUF
>>>>> allocation and file read can be calculated using the formula
>>>>>     T(total) = T(alloc) + T(I/O)
>>>>>
>>>>> However, if we change our approach, we don't necessarily need to wait
>>>>> for the DMA-BUF allocation to complete before initiating I/O. In 
>>>>> fact,
>>>>> during the allocation process, we already hold a portion of the page,
>>>>> which means that waiting for subsequent page allocations to complete
>>>>> before carrying out file reads is actually unfair to the pages 
>>>>> that have
>>>>> already been allocated.
>>>>>
>>>>> The allocation of pages is sequential, and the reading of the file is
>>>>> also sequential, with the content and size corresponding to the file.
>>>>> This means that the memory location for each page, which holds the
>>>>> content of a specific position in the file, can be determined at the
>>>>> time of allocation.
>>>>>
>>>>> However, to fully leverage I/O performance, it is best to wait and
>>>>> gather a certain number of pages before initiating batch processing.
>>>>>
>>>>> The default gather size is 128MB. So, ever gathered can see as a 
>>>>> file read
>>>>> work, it maps the gather page to the vmalloc area to obtain a 
>>>>> continuous
>>>>> virtual address, which is used as a buffer to store the contents 
>>>>> of the
>>>>> corresponding file. So, if using direct I/O to read a file, the file
>>>>> content will be written directly to the corresponding dma-buf 
>>>>> buffer memory
>>>>> without any additional copying.(compare to pipe buffer.)
>>>>>
>>>>> Consider other ways to read into dma-buf. If we assume reading 
>>>>> after mmap
>>>>> dma-buf, we need to map the pages of the dma-buf to the user virtual
>>>>> address space. Also, udmabuf memfd need do this operations too.
>>>>> Even if we support sendfile, the file copy also need buffer, you must
>>>>> setup it.
>>>>> So, mapping pages to the vmalloc area does not incur any additional
>>>>> performance overhead compared to other methods.[6]
>>>>>
>>>>> Certainly, the administrator can also modify the gather size 
>>>>> through patch5.
>>>>>
>>>>> The formula for the time taken for system_heap buffer allocation and
>>>>> file reading through async_read is as follows:
>>>>>
>>>>>    T(total) = T(first gather page) + Max(T(remain alloc), T(I/O))
>>>>>
>>>>> Compared to the synchronous read:
>>>>>    T(total) = T(alloc) + T(I/O)
>>>>>
>>>>> If the allocation time or I/O time is long, the time difference 
>>>>> will be
>>>>> covered by the maximum value between the allocation and I/O. The 
>>>>> other
>>>>> party will be concealed.
>>>>>
>>>>> Therefore, the larger the size of the file that needs to be read, the
>>>>> greater the corresponding benefits will be.
>>>>>
>>>>> How to use
>>>>> ===
>>>>> Consider the current pathway for loading model files into DMA-BUF:
>>>>>    1. open dma-heap, get heap fd
>>>>>    2. open file, get file_fd(can't use O_DIRECT)
>>>>>    3. use file len to allocate dma-buf, get dma-buf fd
>>>>>    4. mmap dma-buf fd, get vaddr
>>>>>    5. read(file_fd, vaddr, file_size) into dma-buf pages
>>>>>    6. share, attach, whatever you want
>>>>>
>>>>> Use DMA_HEAP_ALLOC_AND_READ_FILE JUST a little change:
>>>>>    1. open dma-heap, get heap fd
>>>>>    2. open file, get file_fd(buffer/direct)
>>>>>    3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap 
>>>>> flag, set file_fd
>>>>>       instead of len. get dma-buf fd(contains file content)
>>>>>    4. share, attach, whatever you want
>>>>>
>>>>> So, test it is easy.
>>>>>
>>>>> How to test
>>>>> ===
>>>>> The performance comparison will be conducted for the following 
>>>>> scenarios:
>>>>>    1. normal
>>>>>    2. udmabuf with [3] patch
>>>>>    3. sendfile
>>>>>    4. only patch 1
>>>>>    5. patch1 - patch4.
>>>>>
>>>>> normal:
>>>>>    1. open dma-heap, get heap fd
>>>>>    2. open file, get file_fd(can't use O_DIRECT)
>>>>>    3. use file len to allocate dma-buf, get dma-buf fd
>>>>>    4. mmap dma-buf fd, get vaddr
>>>>>    5. read(file_fd, vaddr, file_size) into dma-buf pages
>>>>>    6. share, attach, whatever you want
>>>>>
>>>>> UDMA-BUF step:
>>>>>    1. memfd_create
>>>>>    2. open file(buffer/direct)
>>>>>    3. udmabuf create
>>>>>    4. mmap memfd
>>>>>    5. read file into memfd vaddr
>>>>>
>>>>> Sendfile step(need suit splice_write/write_iter, just use to 
>>>>> compare):
>>>>>    1. open dma-heap, get heap fd
>>>>>    2. open file, get file_fd(buffer/direct)
>>>>>    3. use file len to allocate dma-buf, get dma-buf fd
>>>>>    4. sendfile file_fd to dma-buf fd
>>>>>    6. share, attach, whatever you want
>>>>>
>>>>> patch1/patch1-4:
>>>>>    1. open dma-heap, get heap fd
>>>>>    2. open file, get file_fd(buffer/direct)
>>>>>    3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap 
>>>>> flag, set file_fd
>>>>>       instead of len. get dma-buf fd(contains file content)
>>>>>    4. share, attach, whatever you want
>>>>>
>>>>> You can create a file to test it. Compare the performance gap 
>>>>> between the two.
>>>>> It is best to compare the differences in file size from KB to MB 
>>>>> to GB.
>>>>>
>>>>> The following test data will compare the performance differences 
>>>>> between 512KB,
>>>>> 8MB, 1GB, and 3GB under various scenarios.
>>>>>
>>>>> Performance Test
>>>>> ===
>>>>>    12G RAM phone
>>>>>    UFS4.0(the maximum speed is 4GB/s. ),
>>>>>    f2fs
>>>>>    kernel 6.1 with patch[7] (or else, can't support kvec direct 
>>>>> I/O read.)
>>>>>    no memory pressure.
>>>>>    drop_cache is used for each test.
>>>>>
>>>>> The average of 5 test results:
>>>>> | scheme-size         | 512KB(ns)  | 8MB(ns)    | 1GB(ns) | 
>>>>> 3GB(ns)       |
>>>>> | ------------------- | ---------- | ---------- | ------------- | 
>>>>> ------------- |
>>>>> | normal              | 2,790,861  | 14,535,784 | 1,520,790,492 | 
>>>>> 3,332,438,754 |
>>>>> | udmabuf buffer I/O  | 1,704,046  | 11,313,476 | 821,348,000 | 
>>>>> 2,108,419,923 |
>>>>> | sendfile buffer I/O | 3,261,261  | 12,112,292 | 1,565,939,938 | 
>>>>> 3,062,052,984 |
>>>>> | patch1-4 buffer I/O | 2,064,538  | 10,771,474 | 986,338,800 | 
>>>>> 2,187,570,861 |
>>>>> | sendfile direct I/O | 12,844,231 | 37,883,938 | 5,110,299,184 | 
>>>>> 9,777,661,077 |
>>>>> | patch1 direct I/O   | 813,215    | 6,962,092  | 2,364,211,877 | 
>>>>> 5,648,897,554 |
>>>>> | udmabuf direct I/O  | 1,289,554  | 8,968,138  | 921,480,784 | 
>>>>> 2,158,305,738 |
>>>>> | patch1-4 direct I/O | 1,957,661  | 6,581,999  | 520,003,538 | 
>>>>> 1,400,006,107 |
>>>
>>> With this test, sendfile can't give a good help base on pipe buffer.
>>>
>>> udmabuf is good, but I think our oem driver can't suit it. (And, 
>>> AOSP do not open this feature)
>>>
>>>
>>> Anyway, I am sending this patchset in the hope of further discussion.
>>>
>>> Thanks.
>>>
>>>>>
>>>>> So, based on the test results:
>>>>>
>>>>> When the file is large, the patchset has the highest performance.
>>>>> Compared to normal, patchset is a 50% improvement;
>>>>> Compared to normal, patch1 only showed a degradation of 41%.
>>>>> patch1 typical performance breakdown is as follows:
>>>>>    1. alloc cost 188,802,693 ns
>>>>>    2. vmap cost 42,491,385 ns
>>>>>    3. file read cost 4,180,876,702 ns
>>>>> Therefore, directly performing a single direct I/O read on a large 
>>>>> file
>>>>> may not be the most optimal way for performance.
>>>>>
>>>>> The performance of direct I/O implemented by the sendfile method 
>>>>> is the worst.
>>>>>
>>>>> When file size is small, The difference in performance is not
>>>>> significant. This is consistent with expectations.
>>>>>
>>>>>
>>>>>
>>>>> Suggested use cases
>>>>> ===
>>>>>    1. When there is a need to read large files and system 
>>>>> resources are scarce,
>>>>>       especially when the size of memory is limited.(GB level) In 
>>>>> this
>>>>>       scenario, using direct I/O for file reading can even bring 
>>>>> performance
>>>>>       improvements.(may need patch2-3)
>>>>>    2. For embedded devices with limited RAM, using direct I/O can 
>>>>> save system
>>>>>       resources and avoid unnecessary data copying. Therefore, 
>>>>> even if the
>>>>>       performance is lower when read small file, it can still be used
>>>>>       effectively.
>>>>>    3. If there is sufficient memory, pinning the page cache of the 
>>>>> model files
>>>>>       in memory and placing file in the EROFS file system for 
>>>>> read-only access
>>>>>       maybe better.(EROFS do not support direct I/O)
>>>>>
>>>>>
>>>>> Changlog
>>>>> ===
>>>>>   v1 [8]
>>>>>   v1->v2:
>>>>>     Uses the heap flag method for alloc and read instead of adding 
>>>>> a new
>>>>>     DMA-buf ioctl command. [9]
>>>>>     Split the patchset to facilitate review and test.
>>>>>       patch 1 implement alloc and read, offer heap flag into it.
>>>>>       patch 2-4 offer async read
>>>>>       patch 5 can change gather limit.
>>>>>
>>>>> Reference
>>>>> ===
>>>>> [1] 
>>>>> https://lore.kernel.org/all/0393cf47-3fa2-4e32-8b3d-d5d5bdece298@amd.com/
>>>>> [2] https://lore.kernel.org/all/ZpTnzkdolpEwFbtu@phenom.ffwll.local/
>>>>> [3] 
>>>>> https://lore.kernel.org/all/20240725021349.580574-1-link@vivo.com/
>>>>> [4] https://lore.kernel.org/all/Zpf5R7fRZZmEwVuR@infradead.org/
>>>>> [5] https://lore.kernel.org/all/ZpiHKY2pGiBuEq4z@infradead.org/
>>>>> [6] 
>>>>> https://lore.kernel.org/all/9b70db2e-e562-4771-be6b-1fa8df19e356@amd.com/
>>>>> [7] 
>>>>> https://patchew.org/linux/20230209102954.528942-1-dhowells@redhat.com/20230209102954.528942-7-dhowells@redhat.com/
>>>>> [8] 
>>>>> https://lore.kernel.org/all/20240711074221.459589-1-link@vivo.com/
>>>>> [9] 
>>>>> https://lore.kernel.org/all/5ccbe705-883c-4651-9e66-6b452c414c74@amd.com/
>>>>>
>>>>> Huan Yang (5):
>>>>>    dma-buf: heaps: Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
>>>>>    dma-buf: heaps: Introduce async alloc read ops
>>>>>    dma-buf: heaps: support alloc async read file
>>>>>    dma-buf: heaps: system_heap alloc support async read
>>>>>    dma-buf: heaps: configurable async read gather limit
>>>>>
>>>>>   drivers/dma-buf/dma-heap.c          | 552 
>>>>> +++++++++++++++++++++++++++-
>>>>>   drivers/dma-buf/heaps/system_heap.c |  70 +++-
>>>>>   include/linux/dma-heap.h            |  53 ++-
>>>>>   include/uapi/linux/dma-heap.h       |  11 +-
>>>>>   4 files changed, 673 insertions(+), 13 deletions(-)
>>>>>
>>>>>
>>>>> base-commit: 931a3b3bccc96e7708c82b30b2b5fa82dfd04890
>>>>
>>