Internal DRI subsystem locking and contention between connector commits

Thu Oct 6 12:03:56 UTC 2022

I have a DRM master implementing a purpose-built compositor for a dedicated use-case. It drives several different connectors, each on its own vsync cadence (there's no clone mode happening here).

The goal is to have commits to each connector occur completely without respect to whatever is happening on the other connectors. There's a different thread issuing the DRI ioctl's for each connector.

In the compositor, each connector is treated like its own little universe; a disjoint set of CRTCs and planes is earmarked for use by each of the connectors. One intention for this is to avoid sharing resources in a way that would introduce implicit synchronization points between the two connector's event loops. So, atomic commits made to one connector never attempt to use a resource that's ever been used in a commit to a different connector. This may be relevant to a question I'll ask a bit later below about resource locking contention.

For some time, I've been noticing that even test-only atomic commits done on connector A will sometimes block for many frame-times. Analysis with the DRI driver implementor has shown that the atomic commits to A--whether DRM_MODE_ATOMIC_TEST_ONLY or DRM_MODE_ATOMIC_NONBLOCK--are getting stuck in the ioctl entry code waiting for a DRI mutex.

It turns out that during these unexpected delays, the DRI driver's commit thread holds that mutex while servicing a commit to connector B. It does this while it waits for the fences to fire for all framebuffer IDs referred to by the pending connector B scene. So the commit to connector A can't be tested or enqueued until the commit to B is completely finished. The driver author reckons that this is unavoidable because every DRM_IOCTL_MODE_ATOMIC ioctl  needs to acquire the same global singleton DRM connection_mutex in order to query or manipulate the connector.

The result is that it's quite difficult to guarantee a framerate on connector A, because unrelated activity performed on connector B can hold global locks for an unpredictable amount of time.

The first question would be: does this story sound consistent? If so, then a couple more questions follow.

Is this kind of implicit interlocking expected? Is there any way to avoid the pending commits getting serialized like that on the kernel side?

Thanks
-Matt

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/wayland-devel/attachments/20221006/52b2c0e0/attachment.htm>