[PATCH v2 RESEND 4/7] swiotlb: Dynamically allocated bounce buffers

Petr Tesařík petr at tesarici.cz
Wed May 17 06:35:10 UTC 2023


Hi Catalin,

On Tue, 16 May 2023 18:59:30 +0100
Catalin Marinas <catalin.marinas at arm.com> wrote:

> On Tue, May 16, 2023 at 08:39:42AM +0200, Petr Tesařík wrote:
> > On Tue, 16 May 2023 08:13:09 +0200
> > Christoph Hellwig <hch at lst.de> wrote:  
> > > On Mon, May 15, 2023 at 07:43:52PM +0000, Michael Kelley (LINUX) wrote:  
> > > > FWIW, I don't think the approach you have implemented here will be
> > > > practical to use for CoCo VMs (SEV, TDX, whatever else).  The problem
> > > > is that dma_direct_alloc_pages() and dma_direct_free_pages() must
> > > > call dma_set_decrypted() and dma_set_encrypted(), respectively.  In CoCo
> > > > VMs, these calls are expensive because they require a hypercall to the host,
> > > > and the operation on the host isn't trivial either.  I haven't measured the
> > > > overhead, but doing a hypercall on every DMA map operation and on
> > > > every unmap operation has long been something we thought we must
> > > > avoid.  The fixed swiotlb bounce buffer space solves this problem by
> > > > doing set_decrypted() in batch at boot time, and never
> > > > doing set_encrypted().    
> > > 
> > > I also suspect it doesn't really scale too well due to the number of
> > > allocations.  I suspect a better way to implement things would be to
> > > add more large chunks that are used just like the main swiotlb buffers.
> > > 
> > > That is when we run out of space try to allocate another chunk of the
> > > same size in the background, similar to what we do with the pool in
> > > dma-pool.c.  This means we'll do a fairly large allocation, so we'll
> > > need compaction or even CMA to back it up, but the other big upside
> > > is that it also reduces the number of buffers that need to be checked
> > > in is_swiotlb_buffer or the free / sync side.  
> > 
> > I have considered this approach. The two main issues I ran into were:
> > 
> > 1. MAX_ORDER allocations were too small (at least with 4K pages), and
> >    even then they would often fail.
> > 
> > 2. Allocating from CMA did work but only from process context.
> >    I made a stab at modifying the CMA allocator to work from interrupt
> >    context, but there are non-trivial interactions with the buddy
> >    allocator. Making them safe from interrupt context looked like a
> >    major task.  
> 
> Can you kick off a worker thread when the swiotlb allocation gets past
> some reserve limit? It still has a risk of failing to bounce until the
> swiotlb buffer is extended.

Yes, this can be done, and some form of a worker thread is in fact on
my roadmap. Initially, I want to see the impact of a "dumb" algorithm
and get some feedback from people like you. Glad your ideas move in a
similar direction as mine. :-)

> > I also had some fears about the length of the dynamic buffer list. I
> > observed maximum length for block devices, and then it roughly followed
> > the queue depth. Walking a few hundred buffers was still fast enough.
> > I admit the list length may become an issue with high-end NVMe and
> > I/O-intensive applications.  
> 
> You could replace the list with an rbtree, O(log n) look-up vs O(n),
> could be faster if you have many bounces active.

I could also do it for individual allocations. And I did it.

First, the paddr lookup does not search for a specific key in the tree,
but rather for a match within a range. The maple tree was invented for
exactly this purpose, so that's what I tried. There are some caveats
when using maple trees from interrupt context, but I made the necessary
modifications in my local tree.

I ran my tests against a SATA virtio disk in an x86-64 VM booted with
swiotlb=force. The results were:

     Compared to plain linked list
small-rw       -6.6%  (4KiB, 1 thread)
parallel-rw   -10.5%  (64KiB, 4 threads)
big-rw         -8.5%  (1MiB, 1 thread)

Of course, these numbers say nothing about the performance of a maple
tree for tracking additional swiotlb chunks, because the mix of
additions, deletions and lookups will be different, but my point is
that a "better" data structure may not always be better.

My testing also suggests that the lookup path is extremely hot. It was
very sensitive even to small changes (like moving the flag setting
before deletion).

Anyway, my greatest objection to allocating additional swiotlb chunks is
that _all_ of them must be searched to determine that the physical
address does _not_ belong to a swiotlb, incurring performance penalty
for non-constrained (presumably fast) devices that do not need a
swiotlb. Sure, this is irrelevant for CoCo VMs where all devices must
use swiotlb, but we've seen other scenarios which benefit from a
dynamically sized swiotlb. It's a bit frustrating if a cheap wifi
adapter reduces your disk performance...

Besides, if the list or tree of swiotlb chunks is protected with a
lock, this lock becomes contended.

Last but not least, the allocation size is dynamic in theory, but it
would most likely never shrink, because a swiotlb chunk can be freed
only if it is completely unused, which may never happen after a spike,
because some mappings are rather long-lived (which is probably wrong
but it's the current state).

Much of the above can be solved if additional swiotlb chunks are
allocated per device. OTOH I am a bit concerned about increasing memory
requirements. After all, one of the use cases is to reduce kernel
memory footprint on mobile devices.

To say something positive, I have also found one upside to additional
swiotlb chunks: They make it possible to meet all alignment and boundary
crossing constraints (unlike plain CMA allocations).

Petr T


More information about the dri-devel mailing list