[PATCH 1/2] dma-buf: heaps: DMA_HEAP_IOCTL_ALLOC_READ_FILE framework

Huan Yang link at vivo.com
Wed Jul 17 07:33:20 UTC 2024


在 2024/7/16 20:07, Christian König 写道:
> Am 16.07.24 um 11:31 schrieb Daniel Vetter:
>> On Tue, Jul 16, 2024 at 10:48:40AM +0800, Huan Yang wrote:
>>> I just research the udmabuf, Please correct me if I'm wrong.
>>>
>>> 在 2024/7/15 20:32, Christian König 写道:
>>>> Am 15.07.24 um 11:11 schrieb Daniel Vetter:
>>>>> On Thu, Jul 11, 2024 at 11:00:02AM +0200, Christian König wrote:
>>>>>> Am 11.07.24 um 09:42 schrieb Huan Yang:
>>>>>>> Some user may need load file into dma-buf, current
>>>>>>> way is:
>>>>>>>      1. allocate a dma-buf, get dma-buf fd
>>>>>>>      2. mmap dma-buf fd into vaddr
>>>>>>>      3. read(file_fd, vaddr, fsz)
>>>>>>> This is too heavy if fsz reached to GB.
>>>>>> You need to describe a bit more why that is to heavy. I can only
>>>>>> assume you
>>>>>> need to save memory bandwidth and avoid the extra copy with the CPU.
>>>>>>
>>>>>>> This patch implement a feature called DMA_HEAP_IOCTL_ALLOC_READ_FILE.
>>>>>>> User need to offer a file_fd which you want to load into
>>>>>>> dma-buf, then,
>>>>>>> it promise if you got a dma-buf fd, it will contains the file content.
>>>>>> Interesting idea, that has at least more potential than trying
>>>>>> to enable
>>>>>> direct I/O on mmap()ed DMA-bufs.
>>>>>>
>>>>>> The approach with the new IOCTL might not work because it is a very
>>>>>> specialized use case.
>>>>>>
>>>>>> But IIRC there was a copy_file_range callback in the file_operations
>>>>>> structure you could use for that. I'm just not sure when and how
>>>>>> that's used
>>>>>> with the copy_file_range() system call.
>>>>> I'm not sure any of those help, because internally they're all still
>>>>> based
>>>>> on struct page (or maybe in the future on folios). And that's the thing
>>>>> dma-buf can't give you, at least without peaking behind the curtain.
>>>>>
>>>>> I think an entirely different option would be malloc+udmabuf. That
>>>>> essentially handles the impendence-mismatch between direct I/O and
>>>>> dma-buf
>>>>> on the dma-buf side. The downside is that it'll make the permanently
>>>>> pinned memory accounting and tracking issues even more apparent, but I
>>>>> guess eventually we do need to sort that one out.
>>>> Oh, very good idea!
>>>> Just one minor correction: it's not malloc+udmabuf, but rather
>>>> create_memfd()+udmabuf.
>> Hm right, it's create_memfd() + mmap(memfd) + udmabuf
>>
>>>> And you need to complete your direct I/O before creating the udmabuf
>>>> since that reference will prevent direct I/O from working.
>>> udmabuf will pin all pages, so, if returned fd, can't trigger direct I/O
>>> (same as dmabuf). So, must complete read before pin it.
>> Why does pinning prevent direct I/O? I haven't tested, but I'd expect the
>> rdma folks would be really annoyed if that's the case ...

I used to believe that a pinned page cannot be re-pinned, so performing 
direct I/O on it would fail.  But I misunderstood, and it doesn't have 
any impact.

dma-buf mmap vaddr can't to trigger direct I/O due to can't pin kernel 
page(PFN), So, not same.


>
> Pinning (or rather taking another page reference) prevents writes from 
> using direct I/O because writes try to find all references and make 
> them read only so that nobody modifies the content while the write is 
> done.
>
> As far as I know the same approach is used for NUMA migration and 
> replacing small pages with big ones in THP. But for the read case here 
> it should still work.

Hmm, with udmabuf direct I/O test, I find this will not effect it. Test 
code  I set in email tail. Maybe pin only let page can't be reclaimed, 
rather prevent the write?



With mine test, udmabuf direct I/O read 3GB file, average cost 2.2s.(I 
use ftrace to trace f2fs_direct_IO can make sure direct IO trigger 
success),  Same as mine normal cache file read cost

My patchset average is 1.2s,The difference between the two was obvious 
before.

>
>>> But current way is use `memfd_pin_folios` to boost alloc and pin, so maybe
>>> need suit it.
>>>
>>>
>>> I currently doubt that the udmabuf solution is suitable for our
>>> gigabyte-level read operations.
>>>
>>> 1. The current mmap operation uses faulting, so frequent page faults will be
>>> triggered during reads, resulting in a lot of context switching overhead.
>>>
>>> 2. current udmabuf size limit is 64MB, even can change, maybe not good to
>>> use in large size?
>> Yeah that's just a figleaf so we don't have to bother about the accounting
>> issue.
>>
>>> 3. The migration and adaptation of the driver is also a challenge, and
>>> currently, we are unable to control it.
>> Why does a udmabuf fd not work instead of any other dmabuf fd? That
>> shouldn't matter for the consuming driver ...
>>
>>> Perhaps implementing `copy_file_range` would be more suitable for us.
>> See my other mail, fundamentally these all rely on struct page being
>> present, and dma-buf doesn't give you that. Which means you need to go
>> below the dma-buf abstraction. And udmabuf is pretty much the thing for
>> that, because it wraps normal struct page memory into a dmabuf.
>>
>> And copy_file_range on the underlying memfd might already work, I haven't
>> checked though.
>
> Yeah completely agree.
>
> Regards,
> Christian.
>
>> Cheers, Sima
>
Test code, if test above 2GB, need this patch:

https://lore.kernel.org/all/20240717065444.369876-1-link@vivo.com/

```c

// SPDX-License-Identifier: GPL-2.0
#define _GNU_SOURCE
#define __EXPORTED_HEADERS__

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>
#include <fcntl.h>
#include <dirent.h>
#include <malloc.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <sys/syscall.h>
#include <sys/mman.h>
#include <time.h>
#include <linux/memfd.h>
#include <linux/udmabuf.h>

#define TEST_PREFIX    "drivers/dma-buf/udmabuf"

// static int memfd_create(const char *name, unsigned int flags)
// {
//     return syscall(__NR_memfd_create, name, flags);
// }

int main(void)
{
     struct udmabuf_create create;
     int devfd, memfd, buf, ret;
     unsigned long size;
         int filefd;
         struct timespec ts_start, ts_end;
     long long start, end;

         clock_gettime(CLOCK_MONOTONIC, &ts_start);

     devfd = open("/dev/udmabuf", O_RDWR);
     if (devfd < 0) {
         printf("%s: [skip,no-udmabuf: Unable to access DMA buffer 
device file]\n",
                TEST_PREFIX);
         exit(77);
     }

     memfd = memfd_create("udmabuf-test", MFD_ALLOW_SEALING);
     if (memfd < 0) {
         printf("%s: [skip,no-memfd]\n", TEST_PREFIX);
         exit(77);
     }

     ret = fcntl(memfd, F_ADD_SEALS, F_SEAL_SHRINK);
     if (ret < 0) {
         printf("%s: [skip,fcntl-add-seals]\n", TEST_PREFIX);
         exit(77);
     }

         filefd = open("/data/model.txt", O_RDONLY | O_DIRECT);
         if (filefd < 0) {
                 printf("%s: [failed to open model.txt]\n", TEST_PREFIX);
                 exit(77);
         }

         struct stat ftat;
         fstat(filefd, &ftat);
         size = (ftat.st_size + getpagesize()) & ~(getpagesize());

     ret = ftruncate(memfd, size);
     if (ret == -1) {
         printf("%s: [FAIL,memfd-truncate]\n", TEST_PREFIX);
         exit(1);
     }

     memset(&create, 0, sizeof(create));

     /* should work */
     create.memfd  = memfd;
     create.offset = 0;
     create.size   = size;
     buf = ioctl(devfd, UDMABUF_CREATE, &create);
     if (buf < 0) {
         printf("%s: [FAIL,test-4]\n", TEST_PREFIX);
         exit(1);
     }

     // fprintf(stderr, "%s: ok\n", TEST_PREFIX);

         void *vaddr = mmap(NULL, size, PROT_WRITE | PROT_READ,
                          MAP_SHARED, memfd, 0);
         if (!vaddr) {
                 printf("%s: [FAIL, mmap]\n", TEST_PREFIX);
                 exit(77);
         }

         unsigned long rsize = size;
         unsigned long bytes = 0;
         while (bytes != size) {
                 ssize_t rb = read(filefd, vaddr, rsize);
                 if (rb < 0) {
                         printf("%s: [FAIL, read]\n", TEST_PREFIX);
                         exit(77);
                 }
                 rsize -= rb;
                 bytes += rb;

         }
         munmap(vaddr, size);
         clock_gettime(CLOCK_MONOTONIC, &ts_end);

#define NSEC_PER_SEC 1000000000LL
         start = ts_start.tv_sec * NSEC_PER_SEC + ts_start.tv_nsec;
         end = ts_end.tv_sec * NSEC_PER_SEC + ts_end.tv_nsec;

         printf("total cost %lld ns\n", end - start);

         printf("going to check content\n");
         void *fvaddr = mmap(NULL, size, PROT_READ, MAP_SHARED, filefd, 0);
         if (!fvaddr) {
                 printf("%s: [FAIL, mmap file]\n", TEST_PREFIX);
                 exit(77);
         }
         vaddr = mmap(NULL, size, PROT_READ, MAP_SHARED, buf, 0);
         if (!vaddr) {
                 printf("%s: [FAIL, mmap dmabuf]\n", TEST_PREFIX);
                 exit(77);
         }

         if (memcmp(fvaddr, vaddr, size) != 0) {
                 printf("%s: [FAIL, content is not same]\n", TEST_PREFIX);
                 exit(77);
         }

         printf("%s: [SUCCESS, content is same]\n", TEST_PREFIX);
         munmap(vaddr, size);
         munmap(fvaddr, size);
         close(filefd);
     close(buf);
     close(memfd);
     close(devfd);
     return 0;
}

```



More information about the dri-devel mailing list