implicit drm synchronization wip with dma-buf

Tue Jun 26 10:38:18 PDT 2012

Hey,

Due to inertia, I thought I would take a shot at implicit synchronization as well.
I have just barely enough to make it work for nouveau to synchronize with itself
now using the cpu. Hopefully the general idea is correct but I feel the
implementation wrong.

There are 2 ways to get deadlocks if no proper care is taken to avoid it,
the first being 2 tasks taking each device's lock in a different order,
the second 2 devices waiting on completion of each other before starting
own work. The easiest way to avoid this is to introduce a global
dma_buf_submit_mutex so in cases where synchronization is needed.
This way only 1 submission involving dma-buffer synchronisation can be made
simultaneously. This will make it impossible to deadlock because even if you
take dma mutex->dev a mutex->dev b mutex, and swap a and b, the dmabuf mutex
would prevent this from being done at the same time.

That leaves the real problem of the synchronization itself. I felt that because
the code involved was already sharing dma-buf's, the easiest way to implement it
would be.. another dma-buf. Some hardware might have specific requirements on them,
so I haven't pinned down the exact details yet.

It's a bit of intermingling between drm and dma-buf namespace since it is an early
wip, any comments are welcome though.

This is what I used so far:

#define DRM_PRIME_FENCE_MAX 2
struct drm_prime_fence {
	struct dma_buf *sync_buf;
	uint64_t offset;
	uint32_t value;
	enum {
		/* Nop is to allow preparations in case dma_buf
		 * is different for release, so the call will
		 * never fail at that point.
		 */
		DRM_PRIME_FENCE_NOP = 0,
		DRM_PRIME_FENCE_WAIT_EQ,
//		DRM_PRIME_FENCE_WAIT_GE, /* block while ((int)(cur - expected) < 0); */
		DRM_PRIME_FENCE_SET
	} op;
};

and added to struct dma_buf_ops:
	/* drm_prime_fence_ is written by function to indicate what is needed
	 * to acquire this buffer, up to DRM_PRIME_FENCE_MAX buffers are allowed
	 * sync_acquire returns a negative value on error, otherwise
	 * amounts of fence ops that need to be executed.
	 *
	 * Release is not allowed to fail and merely returns number of
	 * fence ops that needs to be executed after command stream is done.
	 * Abort occurs when there's a failure between acquire and release,
	 * for example because dma-buf's from multiple devices are involved
	 * and the other one failed to acquire.
	 */
	int (*sync_acquire)(struct dma_buf *, struct drm_prime_fence fence[2],
			    unsigned long align, unsigned long release_write);
	int (*sync_release)(struct dma_buf * struct drm_prime_fence fence[2]);
	void (*sync_abort)(struct dma_buf *);

I'm not completely sure about this part yet, align can be seen as minimum
alignment requirement, ideally I would negotiate those earlier but I haven't
found the correct place yet, maybe on attach?
nouveau writes a 16 byte stamp as part of it's semaphore ops
(4 bytes programmable, 4 bytes pad, 8 bytes timestamp) which is why I need
to communicate those requirements somehow. Not all nouveau cards wold support
DRM_PRIME_FENCE_WAIT_GE either.

I think there is a great power in making the sync object itself just another
dma-buf that can be written to and/or read. Especially since all graphics
card have some way to write an arbitrary 4-byte value to an arbitrary location
(even the oldest intel cards have a blitter! :D). I'm hoping for more input
into making the api better for other users too, which is why I'm posting
this as early as I had something working (for some definition of working).

Thoughts?
~Maarten