[RFC PATCH v3 05/12] netdev: netdevice devmem allocator

Pavel Begunkov asml.silence at gmail.com
Fri Nov 10 14:26:46 UTC 2023


On 11/7/23 23:03, Mina Almasry wrote:
> On Tue, Nov 7, 2023 at 2:55 PM David Ahern <dsahern at kernel.org> wrote:
>>
>> On 11/7/23 3:10 PM, Mina Almasry wrote:
>>> On Mon, Nov 6, 2023 at 3:44 PM David Ahern <dsahern at kernel.org> wrote:
>>>>
>>>> On 11/5/23 7:44 PM, Mina Almasry wrote:
>>>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>>>> index eeeda849115c..1c351c138a5b 100644
>>>>> --- a/include/linux/netdevice.h
>>>>> +++ b/include/linux/netdevice.h
>>>>> @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding {
>>>>>   };
>>>>>
>>>>>   #ifdef CONFIG_DMA_SHARED_BUFFER
>>>>> +struct page_pool_iov *
>>>>> +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
>>>>> +void netdev_free_devmem(struct page_pool_iov *ppiov);
>>>>
>>>> netdev_{alloc,free}_dmabuf?
>>>>
>>>
>>> Can do.
>>>
>>>> I say that because a dmabuf can be host memory, at least I am not aware
>>>> of a restriction that a dmabuf is device memory.
>>>>
>>>
>>> In my limited experience dma-buf is generally device memory, and
>>> that's really its use case. CONFIG_UDMABUF is a driver that mocks
>>> dma-buf with a memfd which I think is used for testing. But I can do
>>> the rename, it's more clear anyway, I think.
>>
>> config UDMABUF
>>          bool "userspace dmabuf misc driver"
>>          default n
>>          depends on DMA_SHARED_BUFFER
>>          depends on MEMFD_CREATE || COMPILE_TEST
>>          help
>>            A driver to let userspace turn memfd regions into dma-bufs.
>>            Qemu can use this to create host dmabufs for guest framebuffers.
>>
>>
>> Qemu is just a userspace process; it is no way a special one.
>>
>> Treating host memory as a dmabuf should radically simplify the io_uring
>> extension of this set.
> 
> I agree actually, and I was about to make that comment to David Wei's
> series once I have the time.
> 
> David, your io_uring RX zerocopy proposal actually works with devmem
> TCP, if you're inclined to do that instead, what you'd do roughly is
> (I think):
That would be a Frankenstein's monster api with no good reason for it.
You bind memory via netlink because you don't have a proper context to
work with otherwise, io_uring serves as the context with a separate and
precise abstraction around queues. Same with dmabufs, it totally makes
sense for device memory, but wrapping host memory into a file just to
immediately unwrap it back with no particular benefits from doing so
doesn't seem like a good uapi. Currently, the difference will be
hidden by io_uring.

And we'd still need to have a hook in pp's get page to grab buffers from
the buffer ring instead of refilling via SO_DEVMEM_DONTNEED and a
callback for when skbs are dropped. It's just instead of a new pp ops
it'll be a branch in the devmem path. io_uring might want to use the
added iov format in the future for device memory or even before that,
io_uring doesn't really care whether it's pages or not.

It's also my big concern from how many optimisations it'll fence us off.
With the current io_uring RFC I can get rid of all buffer atomic
refcounting and replace it with a single percpu counting per skb.
Hopefully, that will be doable after we place it on top of pp providers.


> - Allocate a memfd,
> - Use CONFIG_UDMABUF to create a dma-buf out of that memfd.
> - Bind the dma-buf to the NIC using the netlink API in this RFC.
> - Your io_uring extensions and io_uring uapi should work as-is almost
> on top of this series, I think.
> 
> If you do this the incoming packets should land into your memfd, which
> may or may not work for you. In the future if you feel inclined to use
> device memory, this approach that I'm describing here would be more
> extensible to device memory, because you'd already be using dma-bufs
> for your user memory; you'd just replace one kind of dma-buf (UDMABUF)
> with another.
> 
>> That the io_uring set needs to dive into
>> page_pools is just wrong - complicating the design and code and pushing
>> io_uring into a realm it does not need to be involved in.

I disagree. How does it complicate it? io_uring will be just a yet another
provider implementing the callbacks of the API created for such use cases
and not changing common pp/net bits. The rest of code is in io_uring
implementing interaction with userspace and other usability features, but
there will be anyway some amount of code if we want to have a convenient
and performant api via io_uring.


>>
>> Most (all?) of this patch set can work with any memory; only device
>> memory is unreadable.

-- 
Pavel Begunkov


More information about the dri-devel mailing list