Try to address the DMA-buf coherency problem

Thu Nov 3 22:16:13 UTC 2022

Le mercredi 02 novembre 2022 à 12:18 +0100, Christian König a écrit :
> Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:
> > [SNIP]
> > > > But the client is just a video player. It doesn't understand how to
> > > > allocate BOs for Panfrost or AMD or etnaviv. So without a universal
> > > > allocator (again ...), 'just allocate on the GPU' isn't a useful
> > > > response to the client.
> > > Well exactly that's the point I'm raising: The client *must* understand
> > > that!
> > > 
> > > See we need to be able to handle all restrictions here, coherency of the
> > > data is just one of them.
> > > 
> > > For example the much more important question is the location of the data
> > > and for this allocating from the V4L2 device is in most cases just not
> > > going to fly.
> > It feels like this is a generic statement and there is no reason it could not be
> > the other way around.
> 
> And exactly that's my point. You always need to look at both ways to 
> share the buffer and can't assume that one will always work.
> 
> As far as I can see it you guys just allocate a buffer from a V4L2 
> device, fill it with data and send it to Wayland for displaying.

That paragraph is a bit sloppy. By "you guys" you mean what exactly ? Normal
users will let V4L2 device allocate and write into their own memory (the device
fill it, not "you guys"). This is done like this simply because this is
guarantied to work with the V4L2 device. Most V4L2 device produces known by
userpsace pixel formats and layout, for which userspace know for sure it can
implement a GPU shader or software fallback for. I'm still to see one of these
format that cannot be efficiently imported into a modern GPU and converted using
shaders. I'm not entirely sure what/which GPU a dGPU is compared to a GPU btw.

In many cases, camera kind of V4L2 devices will have 1 producer for many
consumers. Consider your photo application, the streams will likely be capture
and displayed while being encoded by one of more CODEC, while being streamed to
a Machine Learning model for analyses. The software complexity to communicate
back the list of receiver devices and implementing all their non-standard way to
allocate memory, so all the combination of trial and error is just ridiculously
high. Remember that each GPU have their own allocation methods and corner cases,
this is simply not manageable by "you guys", which I pretty much assume is
everyone writing software for Generic Linux these days (non-Android/ChromeOS).

> 
> To be honest I'm really surprised that the Wayland guys hasn't pushed 
> back on this practice already.
> 
> This only works because the Wayland as well as X display pipeline is 
> smart enough to insert an extra copy when it find that an imported 
> buffer can't be used as a framebuffer directly.

This is a bit inaccurate. The compositor I've worked with (Gnome and Weston)
will only memcpy SHM. For DMABuf, they will fail importation if its not usable
either by the display or the GPU. Specially on the GPU side though (which is the
ultimate compositor fallback), there exists efficient HW copy mechanism that may
be used, and this is fine, since unlike your scannout example, it won't be
uploading over and over, but will do later re-display from a remote copy (or
transformed copy). Or if you prefer, its cached at the cost of higher memory
usage.

I think it would be preferable to speak about device to device sharing, since
V4L2 vs GPU is not really representative of the program. I think V4L2 vs GPU and
"you guys" simply contribute to the never ending, and needless friction around
that difficulty that exists with current support for memory sharing in Linux.

> 
> >   I have colleague who integrated PCIe CODEC (Blaize Xplorer
> > X1600P PCIe Accelerator) hosting their own RAM. There was large amount of ways
> > to use it. Of course, in current state of DMABuf, you have to be an exporter to
> > do anything fancy, but it did not have to be like this, its a design choice. I'm
> > not sure in the end what was the final method used, the driver isn't yet
> > upstream, so maybe that is not even final. What I know is that there is various
> > condition you may use the CODEC for which the optimal location will vary. As an
> > example, using the post processor or not, see my next comment for more details.
> 
> Yeah, and stuff like this was already discussed multiple times. Local 
> memory of devices can only be made available by the exporter, not the 
> importer.
> 
> So in the case of separated camera and encoder you run into exactly the 
> same limitation that some device needs the allocation to happen on the 
> camera while others need it on the encoder.
> 
> > > The more common case is that you need to allocate from the GPU and then
> > > import that into the V4L2 device. The background is that all dGPUs I
> > > know of need the data inside local memory (VRAM) to be able to scan out
> > > from it.
> > The reality is that what is common to you, might not be to others. In my work,
> > most ARM SoC have display that just handle direct scannout from cameras and
> > codecs.
> 
> > The only case the commonly fails is whenever we try to display UVC
> > created dmabuf,
> 
> Well, exactly that's not correct! The whole x86 use cases of direct 
> display for dGPUs are broken because media players think they can do the 
> simple thing and offload all the problematic cases to the display server.
> 
> This is absolutely *not* the common use case you describe here, but 
> rather something completely special to ARM.

sigh .. The UVC failures was first discovered on my Intel PC, and later
reproduced on ARM. Userspace expected driver(s) (V4L2 exports, DRM imports)
should have rejected the DMABuf import (I kind of know, I wrote this part). From
a userspace point of you, unlike what you stipulate here, there was no fault.
You said already that importer / exporter role is to be tried, the order you try
should not matter. So yes, today's userspace may lack the ability to flip the
roles, but at least it tries, and if the driver does not fail, you can't blame
userspace for at least trying to achieve decent performance.

I'd like to remind that this is clearly all kernel bugs, and we cannot state
that kernel drivers "are broken because media player". Just the fact that this
thread starts from a kernel changes kind of prove it. Would be nice also for you
to understand that I'm not against the method used in this patchset, but I'm not
against a bracketing mechanism either, as I think the former can improve, where
the first one only give more "correct" results.

> 
> >   which have dirty CPU write cache and this is the type of thing
> > we'd like to see solved. I think this series was addressing it in principle, but
> > failing the import and the raised point is that this wasn't the optimal way.
> > 
> > There is a community project called LibreELEC, if you aren't aware, they run
> > Khodi with direct scanout of video stream on a wide variety of SoC and they use
> > the CODEC as exporter all the time. They simply don't have cases were the
> > opposite is needed (or any kind of remote RAM to deal with). In fact, FFMPEG
> > does not really offer you any API to reverse the allocation.
> 
> Ok, let me try to explain it once more. It sounds like I wasn't able to 
> get my point through.
> 
> That we haven't heard anybody screaming that x86 doesn't work is just 
> because we handle the case that a buffer isn't directly displayable in 
> X/Wayland anyway, but this is absolutely not the optimal solution.

Basically, you are complaining that compositor will use GPU shaders to adapt the
buffers for the display. Most display don't do or have limited YUV support,
flipping the roles or bracketing won't help that. Using a GPU shader to adapt
it, like compositor and userspace do seems all right. And yes, sometimes the
memory will get imported into the GPU very efficiently, something in the mid-
range, and other times some GPU stack (which is userspace) will memcpy. But
remember that the GPU stack is programmed to work with a specific GPU, not the
higher level userland.

> 
> The argument that you want to keep the allocation on the codec side is 
> completely false as far as I can see.

I haven't made this argument, and don't intend to do so. There is nothing in
this thread that should be interpreted as I want, or not want. I want the same
thing as everyone on this list, which is both performance and correct results.

> 
> We already had numerous projects where we reported this practice as bugs 
> to the GStreamer and FFMPEG project because it won't work on x86 with dGPUs.

Links ? Remember that I do read every single bugs and emails around GStreamer
project. I do maintain older and newer V4L2 support in there. I also did
contribute a lot to the mechanism GStreamer have in-place to reverse the
allocation. In fact, its implemented, the problem being that on generic Linux,
the receiver element, like the GL element and the display sink don't have any
API they can rely on to allocate memory. Thus, they don't implement what we call
the allocation offer in GStreamer term. Very often though, on other modern OS,
or APIs like VA, the memory offer is replaced by a context. So the allocation is
done from a "context" which is neither an importer or an exporter. This is
mostly found on MacOS and Windows.

Was there APIs suggested to actually make it manageable by userland to allocate
from the GPU? Yes, this what Linux Device Allocator idea is for. Is that API
ready, no.

Can we at least implement some DRM memory allocation, yes, but remember that,
until very recently, the DRM drivers used by the display path was not exposed
through Wayland. This issue has only been resolved recently, it will take some
time before this propagate through compositors code. And you need compositor
implementation to do GL and multimedia stack implementation. Please, keep in
mind before raising bad practice by GStreamer and FFMPEG developers that getting
all the bit and pieces in place require back and forth, there has been huge gaps
that these devs were not able to overcome yet. Also, remember that these stack
don't have any contract to support Linux. They support it to the best of their
knowledge and capabilities, along with Windows, MacOS, IOS, Android and more.
And to my experience, memory sharing have different challenges in all of these
OS.

> 
> This is just a software solution which works because of coincident and 
> not because of engineering.

Another argument I can't really agree with, there is a lot of effort put into
fallback (mostly GPU fallback) in various software stack. These fallback are
engineered to guaranty you can display your frames. That case I've raised should
have ended well with a GPU/CPU fallback, but a kernel bug broke the ability to
fallback. If the kernel had rejected the import (your series ?), or offered a
bracketing mechanism (for the UVC case, both method would have worked), the end
results would have just worked.

I would not disagree if someone states that DMAbuf in UVC driver is an abuse.
The driver simply memcpy chunk of variable size data streamed by the usb camera
into a normal memory buffer. So why is that exported as a dmabuf ? I don't have
an strong opinion on that, but if you think this is wrong, then it proves my
point that this is a kernel bug. The challenge here is to come up with how we
will fix this, and sharing a good understanding of what today's userspace do,
and why they do so is key to make proper designs. As I started, writing code for
DMABuf subsystem is out of reach for me, I can only share what existing software
do, and why it does it like this.

Nicolas

> 
> Regards,
> Christian.
> 
>