[PATCH] RFC: dma-buf: userspace mmap support

Fri Mar 16 10:24:17 PDT 2012

> From: Rob Clark <rob at ti.com>
> 
> Enable optional userspace access to dma-buf buffers via mmap() on the
> dma-buf file descriptor.  Userspace access to the buffer should be
> bracketed with DMA_BUF_IOCTL_{PREPARE,FINISH}_ACCESS ioctl calls to
> give the exporting driver a chance to deal with cache synchronization
> and such for cached userspace mappings without resorting to page
> faulting tricks.  The reasoning behind this is that, while drm
> drivers tend to have all the mechanisms in place for dealing with
> page faulting tricks, other driver subsystems may not.  And in
> addition, while page faulting tricks make userspace simpler, there
> are some associated overheads.

Speaking for the ARM Mali T6xx driver point of view, this API looks
good for us. Our use-case for mmap is glReadPixels and
glTex[Sub]Image2D on buffers the driver has imported via dma_buf. In
the case of glReadPixels, the finish ioctl isn't strictly necessary
as the CPU won't have written to the buffer and so doesn't need
flushing. As such, we'd get an additional cache flush which isn't
really necessary. But hey, it's glReadPixels - it's supposed to be
slow. :-)

I think requiring the finish ioctl in the API contract is a good
idea, even if the CPU has only done a ro access as it allows future
enhancements*. To "fix" the unnecessary flush in glReadPixels, I
think we'd like to keep the finish but see an "access type"
parameter added to prepare ioctl indicating if the access is ro or
rw to allow the cache flush in finish to be skipped if the access
was ro. As Rebecca says, a debug feature could even be added to
re-map the pages as ro in prepare(ro) to catch naughty accesses. I'd
also go as far as to say the debug feature should completely unmap
the pages after finish too. Though for us, both the access-type
parameter and debug features are "nice to haves" - we can make
progress with the code as it currently stands (assuming exporters
start using the API that is).

Something which also came up when discussing internally is the topic
of mmap APIs of the importing device driver. For example, I believe
DRM has an mmap API on GEM buffer objects. If a new dma_buf import
ioctl was added to GEM (maybe the PRIME patches already add this),
how would GEM's bo mmap API work?

* Future enhancements: The prepare/finish bracketing could be used
as part of a wider synchronization scheme with other devices.
E.g. If another device was writing to the buffer, the prepare ioctl
could block until that device had finished accessing that buffer.
In the same way, another device could be blocked from accessing that
buffer until the client process called finish. We have already
started playing with such a scheme in the T6xx driver stack we're
terming "kernel dependency system". In this scheme each buffer has a
FIFO of "buffer consumers" waiting to access a buffer. The idea
being that a "buffer consumer" is fairly abstract and could be any
device or userspace process participating in the synchronization
scheme. Examples would be GPU jobs, display controller "scan-out"
jobs, etc.

So for example, a userspace application could dispatch a GPU
fragment shading job into the GPU's kernel driver which will write
to a KMS scanout buffer. The application then immediately issues a
drm_mode_crtc_page_flip ioctl on the display controller's DRM driver
to display the soon-to-be-rendered buffer. Inside the kernel, the
GPU driver adds the fragment job to the dma_buf's FIFO. As the FIFO
was empty, dma_buf calls into the GPU kernel driver to tell it it
"owns" access to the buffer and the GPU driver schedules the job to
run on the GPU. Upon receiving the drm_mode_crtc_page_flip ioctl,
the DRM/KMS driver adds a scan-out job to the buffer's FIFO.
However, the FIFO already has the GPU's fragment shading job in it
so nothing happens until the GPU job completes. When the GPU job
completes, the GPU driver calls into dma_buf to mark its job
complete. dma_buf then takes the next job in its FIFO which the KMS
driver's scanout job, calls into the KMS driver to schedule the
pageflip. The result? A buffer gets scanned out as soon as it has
finished being rendered without needing a round-trip to userspace.
Sure, there are easier ways to achieve that goal, but the idea is
that the mechanism can be used to synchronize access across multiple
devices, which makes it useful for lots of other use-cases too.

As I say, we have already implemented something which works as I
describe but where the buffers are abstract resources not linked to
dma_buf. I'd like to discuss the finer points of the mechanisms
further, but if it's looking like there's interest in this approach
we'll start re-writing the code we have to sit on-top of dma_buf
and posting it as RFCs to the various lists. The intention is to get
this to mainline, if mainline wants it. :-)

Personally, what I particularly like about this approach to
synchronization is that it doesn't require any interfaces to be
modified. I think/hope that makes it easier to port existing drivers
and sub-systems to take advantage of it. The buffer itself is the
synchronization object and interfaces already pass buffers around so
don't need modification. There are of course some limitations with
this approach, the main one we can think of being that it can't
really be used for A/V sync. It kinda assumes "jobs" in the FIFO
should be run as soon as the preceding job completes, which isn't
the case when streaming real-time video. Though nothing precludes
more explicit sync objects being used in conjunction with this
approach.

Cheers,

Tom