DMABuf/Wayland buffer sharing

Fri Aug 2 10:04:07 UTC 2019

On 8/2/19 10:54 AM, Pekka Paalanen wrote:
> On Thu, 1 Aug 2019 19:02:57 +0200
> Hi Martin,
> 
> I'd like to ask if we can discuss technical topics in public,
> e.g. this would be on topic for wayland-devel@ mailing list. The answers
> may benefit more people, and OTOH I don't know everything. :-)
> 
> If you are fine with that, please copy this whole email to
> wayland-devel@ when you reply.

Hi Pekka,
sure, cc'ing the list. I'll ask directly there next time.

> Ponderings below.
> 
>> I'm implementing DMABuf backend HW buffers for Firefox and I wonder if
>> you can give me some advice regards it as I have some difficulties with
>> the $SUBJ.
>>
>> I implemented basic dmabuf rendering (allocate/create dmabuf, draw into
>> it, bind it as wl_buffer and sent to compositor), that code lives at
>>
>> https://searchfox.org/mozilla-central/source/widget/gtk/WaylandDMABufSurface.cpp
>>
>> and seems to be working somehow. Now comes the difficult part - I need
>> to map dmabuf to CPU memory and draw into it by SKIA and then send it to
>> a different process and make EGLImage/GL texture from it there.
>>
>> What's the best way how to do that? I tried gbm_bo_import() with
>> GBM_BO_IMPORT_FD (I used fd which was returned from gbm_bo_get_fd()) but
>> that fails with "invalid argument" although all params seems to be sane.
 >>
>> Do I need to configure/export the gbm object somehow and it that
>> supposed to work? Or shall I use the DRM prime for it?
> 
> Nowadays, all dmabufs *should* be mmappable for CPU access. You'd do it
> by just mmap() on the dmabuf fd. E.g. gbm_bo_get_fd() gives you a
> dmabuf fd. Of course, you'll have to ensure it's in linear format or
> you get to deal with tiling manually.

That was caused by wrong fd which was malformed inside Firefox 
machinery, gbm_bo_import() works now.

> Note that some pixel formats or modifiers may imply multiple dmabufs
> per image, so make sure you don't hit those cases - if you limit to the
> usual RGBA formats and the linear modifier, you're fine. YUV less so.

Yes, that's my case.

> An important detail is that you *must* use DMA_BUF_IOCTL_SYNC to
> bracket your actual CPU read/write sequences to the dmabuf. That ioctl
> will ensure the appropriate caches are flushed correctly (you might not
> notice anything wrong on x86, but on other hardware forgetting to do
> that can randomly result in bad data), and I think it also waits for
> implicit fences (e.g. if you had GPU write to the dmabuf earlier, to
> ensure the operation finished).
> 
> I'm not completely sure if all DRM driver everywhere already support
> mmapping the dmabufs they created.
> 
> The other catch is that "casual" CPU access could be extremely slow.
> Think about uncached memory through PCI bus or something. That's why it
> is usually avoided as much as possible. Definitely a bad idea to do
> read-modify-write cycles to it (that is, blending) in general.

The main goal is to use it for video frames export & rendering in 
Firefox so there should not be any reads.

> It all depends on where you allocated the buffer, how the respective
> driver works, and how the hardware works. If it comes from a discrete
> GPU driver, the above caveats are likely. If it's an IGP, you might have
> easier time. If it's the VGEM driver, I guess that could be nice. But
> one cannot really know until you benchmark the exact system you're
> running on.
> 
> If you end up having to use a shadow buffer for the CPU rendering
> anyway, it might be best to let the OpenGL driver worry about getting
> the data to the GPU (glTexImage2D, or some extension that does not imply
> an extra CPU copy like glTexImage2D does).
> 
>> Another option may be to create EGLImage on top of the buffer and send
>> the EGLImage to the render process.
> 
> Let's take a step back to see if I understand your use case correctly.
> 
> You want to do software rendering into a buffer, pass that buffer to
> another process, and then have a GPU directly texture from the buffer.
> Is that correct?

Yes, that's correct. Software rendering into a buffer comes from video 
decoder in content process, then the buffer is passed to render process 
and bound as texture and drawn/composited by EGL to screen. Yes, the 
zero-copy is the ultimate goal here.

> If so, my knowledge of that is hazy. I'm not sure there even exists one
> zero-copy solution that is supposed to both work everywhere and be
> fairly performant. GPUs can be very picky on what they can texture
> from. You need to allocate the buffer to suit the GPU, but sometimes
> that is in direct conflict with wanting to have efficient CPU access to
> it.
> 
> I'm not sure what to recommend.

Mozilla aims to support recent intel drivers first so it's fine when 
that works on that subset of HW. Other ones can use existing SW 
rendering path which is used now.

> FWIW, I've been working on improving the performance of DisplayLink
> devices on Mutter. There is a stack of fallbacks for the zero copy case,
> and I haven't finished implementing the zero copy case yet. In all
> cases, the GPU is rendering the image, and the DisplayLink device (DL)
> is display-only and needs CPU access to the buffer as it is a virtual
> DRM driver.
> 
> - zero copy by either allocating on the GPU and importing to DL, or
>    allocating on DL and importing to GPU;
> 
> - GPU copy from a temporary GPU buffer into a DL buffer imported to GPU;
> 
> - CPU copy (glReadPixels) from a temporary GPU buffer into a mmapped DL
>    buffer.
> 
> The trouble is that each "import" may fail for whatever reason, and I
> must have a fallback.
> 
> My case is the opposite of your case: I have GPU writing and CPU
> reading, you have CPU writing and GPU reading.
> 
> 
>> Also I wonder if it's feasible to use any modifiers as I need
>> plain/linear buffer to draw into by skia. I suspect when I create the
>> buffer with modifiers and then I map it to CPU memory for SW drawing,
>> intermediate buffer is created and then the pixels are de-composed back
>> to GPU memory.
> 
> No, I don't believe there is any kind of intermediate buffer behind the
> scenes in GBM or dmabuf ioctls/mmap. OpenGL drivers may use
> intermediate buffers. Some EGLImage operations are allowed to create
> more buffers behind the scenes but I think implementations try to avoid
> it. Copies are bad for performance, and implicit copies are unexpected
> performance bottle-necks. So yes, I believe you very much need to
> ensure the buffer gets allocated as linear from the start.
> 
> Some hardware may have hardware tiling units, that may be able to
> represent a linear CPU view into a tiled buffer, but I know very little
> of that. I think they might need driver-specific ioctls to use or
> something, and are a scarce resource.

Thanks,
ma.

> Thanks,
> pq
> 

-- 
Martin Stransky
Software Engineer / Red Hat, Inc