DMA-BUFs always uncached on arm64, causing poor camera performance on Librem 5

Thu Jul 10 21:52:42 UTC 2025

On Thu, Jul 10, 2025 at 10:49:19AM +0200, Pavel Machek wrote:
> Hi!
> 
> > > memcpy() from normal memory is about 2msec/1MB. Unfortunately, for
> > > DMA-BUFs it is 20msec/1MB, and that basically means I can't easily do
> > > 760p video recording. Plus, copying full-resolution photo buffer takes
> > > more than 200msec!
> > > 
> > > There's possibility to do some processing on GPU, and its implemented here:
> > > 
> > > https://gitlab.com/tui/tui/-/tree/master/icam?ref_type=heads
> > > 
> > > but that hits the same problem in the end -- data is in DMA-BUF,
> > > uncached, and takes way too long to copy out.
> > > 
> > > And that's ... wrong. DMA ended seconds ago, complete cache flush
> > > would be way cheaper than copying single frame out, and I still have
> > > to deal with uncached frames.
> > > 
> > > So I have two questions:
> > > 
> > > 1) Is my analysis correct that, no matter how I get frame from v4l and
> > > process it on GPU, I'll have to copy it from uncached memory in the
> > > end?
> > 
> > If you need to touch the buffers using the CPU then you are either
> > stuck with uncached memory or you need to implement bracketed access to
> > do the necessary cache maintenance. Be aware that completely flushing
> > the cache is not really an option, as that would impact other
> > workloads, so you have to flush the cache by walking the virtual
> > address space of the buffer, which may take a significant amount of CPU
> > time.
> 
> What kind of "significant amount of CPU time" are we talking here?
> Millisecond?

It really depends on the platform, the type of cache, and the size of
the buffer. I remember that back in the N900 days a selective cash clean
of a large buffer for full resolution images took several dozens of
milliseconds, possibly close to 100ms. We had to clean the whole D-cache
to make it fast enough, but you can't always do that as Lucas mentioned.

> Bracketed access is fine with me.
> 
> Flushing a cache should be an option. I'm root, there's no other
> significant workload, and copying out the buffer takes 200msec+. There
> are lot of cache flushes that can be done in quarter a second!
> 
> > However, if you are only going to use the buffer with the GPU I see no
> > reason to touch it from the CPU side. Why would you even need to copy
> > the content? After all dma-bufs are meant to enable zero-copy between
> > DMA capable accelerators. You can simply import the V4L2 buffer into a
> > GL texture using EGL_EXT_image_dma_buf_import. Using this path you
> > don't need to bother with the cache at all, as the GPU will directly
> > read the video buffers from RAM.
> 
> Yes, so GPU will read video buffer from RAM, then debayer it, and then
> what? Then I need to store a data into raw file, or use CPU to turn it
> into JPEG file, or maybe run video encoder on it. That are all tasks
> that are done on CPU...

-- 
Regards,

Laurent Pinchart