[PATCH] RFC: dma-buf: userspace mmap support

Mon Mar 19 11:44:45 PDT 2012

> -----Original Message-----
> From: Alan Cox [mailto:alan at lxorguk.ukuu.org.uk]
> Sent: 19 March 2012 16:57
> To: Tom Cooksey
> Cc: 'Rob Clark'; linaro-mm-sig at lists.linaro.org; dri-
> devel at lists.freedesktop.org; linux-media at vger.kernel.org;
> rschultz at google.com; Rob Clark; sumit.semwal at linaro.org;
> patches at linaro.org
> Subject: Re: [PATCH] RFC: dma-buf: userspace mmap support
> 
> > If the API was to also be used for synchronization it would have to
> > include an atomic "prepare multiple" ioctl which blocked until all
> > the buffers listed by the application were available. In the same
> 
> Too slow already. You are now serializing stuff while what we want to
> do
> really is
> 
> 	nobody_else_gets_buffers_next([list])
> 	on available(buffer)
> 		dispatch_work(buffer)
> 
> so that you can maximise parallelism without allowing deadlocks. If
> you've got a high memory bandwith and 8+ cores the 'stop everything'
> model isn't great.

Yes, sorry I wasn't clear here. By atomic I meant that a job starts
using all buffers at the same time, once they are available. You are
right, a job waiting for a list of buffers to become available should
not prevent other jobs running or queuing new jobs (eughh). We actually
have the option of using asynchronous call-backs in KDS: A driver lists
all the buffers it needs when adding a job and that job gets added to
the FIFO of each buffer as an atomic operation. However, once the job
is added to all the FIFOs, that atomic operation is complete and another
job can be "queued" up. When a job completes, it is removed from each
buffer's FIFO. At that point, all the "next" jobs in each buffer's FIFO
are evaluated to see if they can run. If they can run, the job's
"start" call-back is called. There's also a synchronous mode of
operation where a blocked thread is "woken up" instead of calling a
call-back function. It is this synchronous mode I would imagine
would be used for user-space access.

> > This might be a good argument for keeping synchronization and cache
> > maintenance separate, though even ignoring synchronization I would
> > think being able to issue cache maintenance operations for multiple
> > buffers in a single ioctl might present some small efficiency gains.
> > However as Rob points out, CPU access is already in slow/legacy
> > territory.
> 
> Dangerous assumption. I do think they should be separate. For one it
> makes the case of synchronization needed but hardware cache management
> much easier to split cleanly. Assuming CPU access is slow/legacy
> reflects a certain model of relatively slow CPU and accelerators
> where falling off the acceleration path is bad. On a higher end
> processor falling off the acceleration path isn't a performance
> matter so much as a power concern.

On some GPU architectures, glReadPixels is a _very_ heavy-weight
operation, so is very much a performance issue and I think always
will be. However I think this might be a special case for certain
GPUs: Other GPU architectures or device-types might be able to
share data with the CPU without such a large impact to performance.
The example of writing subtitles onto a video frame decoded by
a v4l2 hardware codec seems a good example.

> > KDS we differentiated jobs which needed "exclusive access" to a
> > buffer and jobs which needed "shared access" to a buffer. Multiple
> > jobs could access a buffer at the same time if those jobs all
> 
> Makes sense as it's a reader/writer lock and it reflects MESI/MOESI
> caching and cache policy in some hardware/software assists.

Actually, this got me thinking... Several ARM implementations rely
on CPU/NEON to perform X.Org's 2D operations and those tend to
operate directly on the framebuffer. So in that case, both the CPU
and display controller need to access the same buffer at the same
time, even though one of them is writing to the buffer. This is
the main reason we called it shared/exclusive access in KDS rather
than read-only/read-write access. In such scenarios, you'd still
want to do a CPU cache flush after CPU-based 2D drawing is complete
to make sure the display controller "saw" those changes. So yes,
perhaps there's actually a use-case where synchronization must be
kept separate to cache-maintenance? In which case, it is worth
making the proposed prepare/finish API more explicit in that it is
a CPU cache invalidate and CPU cache flush operation only? Or are
there other things one might want to do in prepare/finish?
Automatic cache domain tracking for example?

> > display controller will be reading the front buffer, but the GPU
> > might also need to read that front buffer. So perhaps adding
> > "read-only" & "read-write" access flags to prepare could also be
> > interpreted as shared & exclusive accesses, if we went down this
> > route for synchronization that is. :-)
> 
> mmap includes read/write info so probably using that works out. It also
> means that you have the stuff mapped in a way that will bus error or
> segfault anyone who goofs rather than give them the usual 'deep
> weirdness' behaviour you get with mishandling of caching bits.

I think it might be possible to make the case to cache user-space
mappings. In which case, you might want to always mmap read-write
but sometimes do a read operation and sometimes a write. So I think
we'd prefer not to make the read-only/read-write decision at mmap
time.

Cheers,

Tom