Try to address the DMA-buf coherency problem

Wed Nov 2 11:39:20 UTC 2022

Hi Christian,

going to reply in more detail when I have some more time, so just some
quick thoughts for now.

Am Mittwoch, dem 02.11.2022 um 12:18 +0100 schrieb Christian König:
> Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:
> > [SNIP]
> > > > But the client is just a video player. It doesn't understand how to
> > > > allocate BOs for Panfrost or AMD or etnaviv. So without a universal
> > > > allocator (again ...), 'just allocate on the GPU' isn't a useful
> > > > response to the client.
> > > Well exactly that's the point I'm raising: The client *must* understand
> > > that!
> > > 
> > > See we need to be able to handle all restrictions here, coherency of the
> > > data is just one of them.
> > > 
> > > For example the much more important question is the location of the data
> > > and for this allocating from the V4L2 device is in most cases just not
> > > going to fly.
> > It feels like this is a generic statement and there is no reason it could not be
> > the other way around.
> 
> And exactly that's my point. You always need to look at both ways to 
> share the buffer and can't assume that one will always work.
> 
> As far as I can see it you guys just allocate a buffer from a V4L2 
> device, fill it with data and send it to Wayland for displaying.
> 
> To be honest I'm really surprised that the Wayland guys hasn't pushed 
> back on this practice already.
> 
> This only works because the Wayland as well as X display pipeline is 
> smart enough to insert an extra copy when it find that an imported 
> buffer can't be used as a framebuffer directly.
> 
With bracketed access you could even make this case work, as the dGPU
would be able to slurp a copy of the dma-buf into LMEM for scanout.

> >   I have colleague who integrated PCIe CODEC (Blaize Xplorer
> > X1600P PCIe Accelerator) hosting their own RAM. There was large amount of ways
> > to use it. Of course, in current state of DMABuf, you have to be an exporter to
> > do anything fancy, but it did not have to be like this, its a design choice. I'm
> > not sure in the end what was the final method used, the driver isn't yet
> > upstream, so maybe that is not even final. What I know is that there is various
> > condition you may use the CODEC for which the optimal location will vary. As an
> > example, using the post processor or not, see my next comment for more details.
> 
> Yeah, and stuff like this was already discussed multiple times. Local 
> memory of devices can only be made available by the exporter, not the 
> importer.
> 
> So in the case of separated camera and encoder you run into exactly the 
> same limitation that some device needs the allocation to happen on the 
> camera while others need it on the encoder.
> 
> > > The more common case is that you need to allocate from the GPU and then
> > > import that into the V4L2 device. The background is that all dGPUs I
> > > know of need the data inside local memory (VRAM) to be able to scan out
> > > from it.
> > The reality is that what is common to you, might not be to others. In my work,
> > most ARM SoC have display that just handle direct scannout from cameras and
> > codecs.
> 
> > The only case the commonly fails is whenever we try to display UVC
> > created dmabuf,
> 
> Well, exactly that's not correct! The whole x86 use cases of direct 
> display for dGPUs are broken because media players think they can do the 
> simple thing and offload all the problematic cases to the display server.
> 
> This is absolutely *not* the common use case you describe here, but 
> rather something completely special to ARM.

It the normal case for a lot of ARM SoCs. That world is certainly not
any less big than the x86 dGPU world. A huge number of devices are ARM
based set-top boxes and other video players. Just because it is a
special case for you doesn't mean it's a global special case.

> 
> >   which have dirty CPU write cache and this is the type of thing
> > we'd like to see solved. I think this series was addressing it in principle, but
> > failing the import and the raised point is that this wasn't the optimal way.
> > 
> > There is a community project called LibreELEC, if you aren't aware, they run
> > Khodi with direct scanout of video stream on a wide variety of SoC and they use
> > the CODEC as exporter all the time. They simply don't have cases were the
> > opposite is needed (or any kind of remote RAM to deal with). In fact, FFMPEG
> > does not really offer you any API to reverse the allocation.
> 
> Ok, let me try to explain it once more. It sounds like I wasn't able to 
> get my point through.
> 
> That we haven't heard anybody screaming that x86 doesn't work is just 
> because we handle the case that a buffer isn't directly displayable in 
> X/Wayland anyway, but this is absolutely not the optimal solution.
> 
> The argument that you want to keep the allocation on the codec side is 
> completely false as far as I can see.
> 
> We already had numerous projects where we reported this practice as bugs 
> to the GStreamer and FFMPEG project because it won't work on x86 with dGPUs.
> 
And on a lot of ARM SoCs it's exactly the right thing to do. Many
codecs need contiguous memory there, so importing a scatter-gather
buffer from the GPU via dma-buf will simply not work.

> This is just a software solution which works because of coincident and 
> not because of engineering.

By mandating a software fallback for the cases where you would need
bracketed access to the dma-buf, you simply shift the problem into
userspace. Userspace then creates the bracket by falling back to some
other import option that mostly do a copy and then the appropriate
cache maintenance.

While I understand your sentiment about the DMA-API design being
inconvenient when things are just coherent by system design, the DMA-
API design wasn't done this way due to bad engineering, but due to the
fact that performant DMA access on some systems just require this kind
of bracketing.

Regards,
Lucas