Try to address the DMA-buf coherency problem

Tue Nov 1 21:09:08 UTC 2022

Le mardi 01 novembre 2022 à 18:40 +0100, Christian König a écrit :
> Am 28.10.22 um 20:47 schrieb Daniel Stone:
> > Hi Christian,
> > 
> > On Fri, 28 Oct 2022 at 18:50, Christian König
> > <ckoenig.leichtzumerken at gmail.com> wrote:
> > > Am 28.10.22 um 17:46 schrieb Nicolas Dufresne:
> > > > Though, its not generically possible to reverse these roles. If you want to do
> > > > so, you endup having to do like Android (gralloc) and ChromeOS (minigbm),
> > > > because you will have to allocate DRM buffers that knows about importer specific
> > > > requirements. See link [1] for what it looks like for RK3399, with Motion Vector
> > > > size calculation copied from the kernel driver into a userspace lib (arguably
> > > > that was available from V4L2 sizeimage, but this is technically difficult to
> > > > communicate within the software layers). If you could let the decoder export
> > > > (with proper cache management) the non-generic code would not be needed.
> > > Yeah, but I can also reverse the argument:
> > > 
> > > Getting the parameters for V4L right so that we can share the image is
> > > tricky, but getting the parameters so that the stuff is actually
> > > directly displayable by GPUs is even trickier.
> > > 
> > > Essentially you need to look at both sides and interference to get to a
> > > common ground, e.g. alignment, pitch, width/height, padding, etc.....
> > > 
> > > Deciding from which side to allocate from is just one step in this
> > > process. For example most dGPUs can't display directly from system
> > > memory altogether, but it is possible to allocate the DMA-buf through
> > > the GPU driver and then write into device memory with P2P PCI transfers.
> > > 
> > > So as far as I can see switching importer and exporter roles and even
> > > having performant extra fallbacks should be a standard feature of userspace.
> > > 
> > > > Another case where reversing the role is difficult is for case where you need to
> > > > multiplex the streams (let's use a camera to illustrate) and share that with
> > > > multiple processes. In these uses case, the DRM importers are volatile, which
> > > > one do you abuse to do allocation from ? In multimedia server like PipeWire, you
> > > > are not really aware if the camera will be used by DRM or not, and if something
> > > > "special" is needed in term of role inversion. It is relatively easy to deal
> > > > with matching modifiers, but using downstream (display/gpu) as an exporter is
> > > > always difficult (and require some level of abuse and guessing).
> > > Oh, very good point! Yeah we do have use cases for this where an input
> > > buffer is both displayed as well as encoded.
> > This is the main issue, yeah.
> > 
> > For a standard media player, they would try to allocate through V4L2
> > and decode through that into locally-allocated buffers. All they know
> > is that there's a Wayland server at the other end of a socket
> > somewhere which will want to import the FD. The server does give you
> > some hints along the way: it will tell you that importing into a
> > particular GPU target device is necessary as the ultimate fallback,
> > and importing into a particular KMS device is preferable as the
> > optimal path to hit an overlay.
> > 
> > So let's say that the V4L2 client does what you're proposing: it
> > allocates a buffer chain, schedules a decode into that buffer, and
> > passes it along to the server to import. The server fails to import
> > the buffer into the GPU, and tells the client this. The client then
> > ... well, it doesn't know that it needs to allocate within the GPU
> > instead, but it knows that doing so might be one thing which would
> > make the request succeed.
> > 
> > But the client is just a video player. It doesn't understand how to
> > allocate BOs for Panfrost or AMD or etnaviv. So without a universal
> > allocator (again ...), 'just allocate on the GPU' isn't a useful
> > response to the client.
> 
> Well exactly that's the point I'm raising: The client *must* understand 
> that!
> 
> See we need to be able to handle all restrictions here, coherency of the 
> data is just one of them.
> 
> For example the much more important question is the location of the data 
> and for this allocating from the V4L2 device is in most cases just not 
> going to fly.

It feels like this is a generic statement and there is no reason it could not be
the other way around. I have colleague who integrated PCIe CODEC (Blaize Xplorer
X1600P PCIe Accelerator) hosting their own RAM. There was large amount of ways
to use it. Of course, in current state of DMABuf, you have to be an exporter to
do anything fancy, but it did not have to be like this, its a design choice. I'm
not sure in the end what was the final method used, the driver isn't yet
upstream, so maybe that is not even final. What I know is that there is various
condition you may use the CODEC for which the optimal location will vary. As an
example, using the post processor or not, see my next comment for more details.

> 
> The more common case is that you need to allocate from the GPU and then 
> import that into the V4L2 device. The background is that all dGPUs I 
> know of need the data inside local memory (VRAM) to be able to scan out 
> from it.

The reality is that what is common to you, might not be to others. In my work,
most ARM SoC have display that just handle direct scannout from cameras and
codecs. The only case the commonly fails is whenever we try to display UVC
created dmabuf, which have dirty CPU write cache and this is the type of thing
we'd like to see solved. I think this series was addressing it in principle, but
failing the import and the raised point is that this wasn't the optimal way.

There is a community project called LibreELEC, if you aren't aware, they run
Khodi with direct scanout of video stream on a wide variety of SoC and they use
the CODEC as exporter all the time. They simply don't have cases were the
opposite is needed (or any kind of remote RAM to deal with). In fact, FFMPEG
does not really offer you any API to reverse the allocation.

> 
> > I fully understand your point about APIs like Vulkan not sensibly
> > allowing bracketing, and that's fine. On the other hand, a lot of
> > extant usecases (camera/codec -> GPU/display, GPU -> codec, etc) on
> > Arm just cannot fulfill complete coherency. On a lot of these
> > platforms, despite what you might think about the CPU/GPU
> > capabilities, the bottleneck is _always_ memory bandwidth, so
> > mandating extra copies is an absolute non-starter, and would instantly
> > cripple billions of devices. Lucas has been pretty gentle, but to be
> > more clear, this is not an option and won't be for at least the next
> > decade.
> 
> Well x86 pretty much has the same restrictions.
> 
> For example the scanout buffer is usually always in local memory because 
> you often scan out at up to 120Hz while your recording is only 30fps and 
> most of the time lower resolution.
> 
> Pumping all that data 120 time a second over the PCIe bus would just not 
> be doable in a lot of use cases.

This is good point for this case. Though, the effect of using remote RAM in
CODEC can be very dramatic. In some case, the buffer you are going to display
are what we call the reference frames. That means that while you are displaying
these, the CODEC needs to read from these in order to construct the following
frames. Most of the time, reads are massively slower with remote RAM, and over-
uploading, like you describe here is going to be the most optimal path.

Note that in some other cases, the buffers are called secondary buffers, which
is the outcome of a post processor embedded into the CODEC. In that case, remote
RAM may be fine, it really depends on the write speed (though usually really
good compared to reads). So yes, in a case of high refresh rate, a CODEC with
post processor may do a better job (if you have a single display path for that
buffer). 

p.s. Note that the reason we support reference frame display even if secondary
buffer is possible is often because we are limited on memory bandwidth. For
let's say 4K60, a secondary render will require an extra 18gb writes (on the
best platform, might require equivalent reads on other platforms, like Mediatek,
Samsung Exynos and some other).
> 
> > So we obviously need a third way at this point, because 'all devices
> > must always be coherent' vs. 'cache must be an unknown' can't work.
> > How about this as a suggestion: we have some unused flags in the PRIME
> > ioctls. Can we add a flag for 'import must be coherent'?
> 
> That's pretty much exactly what my patch set does. It just keeps 
> userspace out of the way and says that creating the initial connection 
> between the devices fails if they can't talk directly with each other.
> 
> Maybe we should move that into userspace so that the involved components 
> know of hand that a certain approach won't work?

Anything that an be predict without trial and error is great idea for sure.
Though, we have to be realist, there is no guarantied way to be sure other then
trying. So I would not be too worried. Imho, I lacked the time to try it out,
but the current implementation should in theory make software like GStreamer
fallback to memcpy for simple cases like UVC Cameras. Very simple gstreamer
pipeline most folks can run (and that usually has produced artifacts for years)
is:

gst-launch-1.0 v4l2src ! queue ! kmssink

p.s. you need to turn off any drm master, may not work on multi-display setup,
kmssink is very limited, but it does implement memcpy fallback if dmabuf import
fails (just that it does not fail).

> 
> > That flag wouldn't be set for the existing ecosystem
> > Lucas/Nicolas/myself are talking about, where we have explicit
> > handover points and users are fully able to perform cache maintenance.
> > For newer APIs where it's not possible to properly express that
> > bracketing, they would always set that flag (unless we add an API
> > carve-out where the client promises to do whatever is required to
> > maintain that).
> > 
> > Would that be viable?
> 
> No, as I said. Explicit handover points are just an absolutely no-go. We 
> just have way to many use cases which don't work with that idea.
> 
> As I said we made the same mistake with the DMA-Api and even more 20 
> years later are still running into problems because of that.
> 
> Just try to run any dGPU under a XEN hypervisor with memory 
> fragmentation for a very good example why this is such a bad idea.
> 
> Regards,
> Christian.
> 
> > 
> > Cheers,
> > Daniel
>