[Mesa-dev] [PATCH 03/15] dma-buf: Document dma-buf implicit fencing/resv fencing rules

Wed Jun 23 08:41:33 UTC 2021

Am 22.06.21 um 18:54 schrieb Daniel Vetter:
> Docs for struct dma_resv are fairly clear:
>
> "A reservation object can have attached one exclusive fence (normally
> associated with write operations) or N shared fences (read
> operations)."
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdri.freedesktop.org%2Fdocs%2Fdrm%2Fdriver-api%2Fdma-buf.html%23reservation-objects&data=04%7C01%7Cchristian.koenig%40amd.com%7C42d8c70b62084a846e9508d9359e8629%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637599777264104449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=rcd4i0VpK3BhwFLzsxd66OhshdJJh3yhRo2MOqlCEBo%3D&reserved=0
>
> Furthermore a review across all of upstream.
>
> First of render drivers and how they set implicit fences:
>
> - nouveau follows this contract, see in validate_fini_no_ticket()
>
> 			nouveau_bo_fence(nvbo, fence, !!b->write_domains);
>
>    and that last boolean controls whether the exclusive or shared fence
>    slot is used.
>
> - radeon follows this contract by setting
>
> 		p->relocs[i].tv.num_shared = !r->write_domain;
>
>    in radeon_cs_parser_relocs(), which ensures that the call to
>    ttm_eu_fence_buffer_objects() in radeon_cs_parser_fini() will do the
>    right thing.
>
> - vmwgfx seems to follow this contract with the shotgun approach of
>    always setting ttm_val_buf->num_shared = 0, which means
>    ttm_eu_fence_buffer_objects() will only use the exclusive slot.
>
> - etnaviv follows this contract, as can be trivially seen by looking
>    at submit_attach_object_fences()
>
> - i915 is a bit a convoluted maze with multiple paths leading to
>    i915_vma_move_to_active(). Which sets the exclusive flag if
>    EXEC_OBJECT_WRITE is set. This can either come as a buffer flag for
>    softpin mode, or through the write_domain when using relocations. It
>    follows this contract.
>
> - lima follows this contract, see lima_gem_submit() which sets the
>    exclusive fence when the LIMA_SUBMIT_BO_WRITE flag is set for that
>    bo
>
> - msm follows this contract, see msm_gpu_submit() which sets the
>    exclusive flag when the MSM_SUBMIT_BO_WRITE is set for that buffer
>
> - panfrost follows this contract with the shotgun approach of just
>    always setting the exclusive fence, see
>    panfrost_attach_object_fences(). Benefits of a single engine I guess
>
> - v3d follows this contract with the same shotgun approach in
>    v3d_attach_fences_and_unlock_reservation(), but it has at least an
>    XXX comment that maybe this should be improved
>
> - v4c uses the same shotgun approach of always setting an exclusive
>    fence, see vc4_update_bo_seqnos()
>
> - vgem also follows this contract, see vgem_fence_attach_ioctl() and
>    the VGEM_FENCE_WRITE. This is used in some igts to validate prime
>    sharing with i915.ko without the need of a 2nd gpu
>
> - vritio follows this contract again with the shotgun approach of
>    always setting an exclusive fence, see virtio_gpu_array_add_fence()
>
> This covers the setting of the exclusive fences when writing.
>
> Synchronizing against the exclusive fence is a lot more tricky, and I
> only spot checked a few:
>
> - i915 does it, with the optional EXEC_OBJECT_ASYNC to skip all
>    implicit dependencies (which is used by vulkan)
>
> - etnaviv does this. Implicit dependencies are collected in
>    submit_fence_sync(), again with an opt-out flag
>    ETNA_SUBMIT_NO_IMPLICIT. These are then picked up in
>    etnaviv_sched_dependency which is the
>    drm_sched_backend_ops->dependency callback.
>
> - v4c seems to not do much here, maybe gets away with it by not having
>    a scheduler and only a single engine. Since all newer broadcom chips than
>    the OG vc4 use v3d for rendering, which follows this contract, the
>    impact of this issue is fairly small.
>
> - v3d does this using the drm_gem_fence_array_add_implicit() helper,
>    which then it's drm_sched_backend_ops->dependency callback
>    v3d_job_dependency() picks up.
>
> - panfrost is nice here and tracks the implicit fences in
>    panfrost_job->implicit_fences, which again the
>    drm_sched_backend_ops->dependency callback panfrost_job_dependency()
>    picks up. It is mildly questionable though since it only picks up
>    exclusive fences in panfrost_acquire_object_fences(), but not buggy
>    in practice because it also always sets the exclusive fence. It
>    should pick up both sets of fences, just in case there's ever going
>    to be a 2nd gpu in a SoC with a mali gpu. Or maybe a mali SoC with a
>    pcie port and a real gpu, which might actually happen eventually. A
>    bug, but easy to fix. Should probably use the
>    drm_gem_fence_array_add_implicit() helper.
>
> - lima is nice an easy, uses drm_gem_fence_array_add_implicit() and
>    the same schema as v3d.
>
> - msm is mildly entertaining. It also supports MSM_SUBMIT_NO_IMPLICIT,
>    but because it doesn't use the drm/scheduler it handles fences from
>    the wrong context with a synchronous dma_fence_wait. See
>    submit_fence_sync() leading to msm_gem_sync_object(). Investing into
>    a scheduler might be a good idea.
>
> - all the remaining drivers are ttm based, where I hope they do
>    appropriately obey implicit fences already. I didn't do the full
>    audit there because a) not follow the contract would confuse ttm
>    quite well and b) reading non-standard scheduler and submit code
>    which isn't based on drm/scheduler is a pain.
>
> Onwards to the display side.
>
> - Any driver using the drm_gem_plane_helper_prepare_fb() helper will
>    correctly. Overwhelmingly most drivers get this right, except a few
>    totally dont. I'll follow up with a patch to make this the default
>    and avoid a bunch of bugs.
>
> - I didn't audit the ttm drivers, but given that dma_resv started
>    there I hope they get this right.
>
> In conclusion this IS the contract, both as documented and
> overwhelmingly implemented, specically as implemented by all render
> drivers except amdgpu.
>
> Amdgpu tried to fix this already in
>
> commit 049aca4363d8af87cab8d53de5401602db3b9999
> Author: Christian König <christian.koenig at amd.com>
> Date:   Wed Sep 19 16:54:35 2018 +0200
>
>      drm/amdgpu: fix using shared fence for exported BOs v2
>
> but this fix falls short on a number of areas:
>
> - It's racy, by the time the buffer is shared it might be too late. To
>    make sure there's definitely never a problem we need to set the
>    fences correctly for any buffer that's potentially exportable.
>
> - It's breaking uapi, dma-buf fds support poll() and differentitiate
>    between, which was introduced in
>
> 	commit 9b495a5887994a6d74d5c261d012083a92b94738
> 	Author: Maarten Lankhorst <maarten.lankhorst at canonical.com>
> 	Date:   Tue Jul 1 12:57:43 2014 +0200
>
> 	    dma-buf: add poll support, v3
>
> - Christian König wants to nack new uapi building further on this
>    dma_resv contract because it breaks amdgpu, quoting
>
>    "Yeah, and that is exactly the reason why I will NAK this uAPI change.
>
>    "This doesn't works for amdgpu at all for the reasons outlined above."
>
>    https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2Ff2eb6751-2f82-9b23-f57e-548de5b729de%40gmail.com%2F&data=04%7C01%7Cchristian.koenig%40amd.com%7C42d8c70b62084a846e9508d9359e8629%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637599777264114436%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=CboicEqZau1%2FlEgEM3w4Ye2Nq6wULjePIehxaMqW3Fg%3D&reserved=0
>
>    Rejecting new development because your own driver is broken and
>    violates established cross driver contracts and uapi is really not
>    how upstream works.
>
> Now this patch will have a severe performance impact on anything that
> runs on multiple engines. So we can't just merge it outright, but need
> a bit a plan:
>
> - amdgpu needs a proper uapi for handling implicit fencing. The funny
>    thing is that to do it correctly, implicit fencing must be treated
>    as a very strange IPC mechanism for transporting fences, where both
>    setting the fence and dependency intercepts must be handled
>    explicitly. Current best practices is a per-bo flag to indicate
>    writes, and a per-bo flag to to skip implicit fencing in the CS
>    ioctl as a new chunk.
>
> - Since amdgpu has been shipping with broken behaviour we need an
>    opt-out flag from the butchered implicit fencing model to enable the
>    proper explicit implicit fencing model.
>
> - for kernel memory fences due to bo moves at least the i915 idea is
>    to use ttm_bo->moving. amdgpu probably needs the same.
>
> - since the current p2p dma-buf interface assumes the kernel memory
>    fence is in the exclusive dma_resv fence slot we need to add a new
>    fence slot for kernel fences, which must never be ignored. Since
>    currently only amdgpu supports this there's no real problem here
>    yet, until amdgpu gains a NO_IMPLICIT CS flag.
>
> - New userspace needs to ship in enough desktop distros so that users
>    wont notice the perf impact. I think we can ignore LTS distros who
>    upgrade their kernels but not their mesa3d snapshot.
>
> - Then when this is all in place we can merge this patch here.
>
> What is not a solution to this problem here is trying to make the
> dma_resv rules in the kernel more clever. The fundamental issue here
> is that the amdgpu CS uapi is the least expressive one across all
> drivers (only equalled by panfrost, which has an actual excuse) by not
> allowing any userspace control over how implicit sync is conducted.
>
> Until this is fixed it's completely pointless to make the kernel more
> clever to improve amdgpu, because all we're doing is papering over
> this uapi design issue. amdgpu needs to attain the status quo
> established by other drivers first, once that's achieved we can tackle
> the remaining issues in a consistent way across drivers.
>
> v2: Bas pointed me at AMDGPU_GEM_CREATE_EXPLICIT_SYNC, which I
> entirely missed.
>
> This is great because it means the amdgpu specific piece for proper
> implicit fence handling exists already, and that since a while. The
> only thing that's now missing is
> - fishing the implicit fences out of a shared object at the right time
> - setting the exclusive implicit fence slot at the right time.
>
> Jason has a patch series to fill that gap with a bunch of generic
> ioctl on the dma-buf fd:
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-1-jason%40jlekstrand.net%2F&data=04%7C01%7Cchristian.koenig%40amd.com%7C42d8c70b62084a846e9508d9359e8629%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637599777264114436%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2FXL98%2F2%2F1GZOICxCTPTzfIvcSm144vPBjUDM29aeyF8%3D&reserved=0
>
> v3: Since Christian has fixed amdgpu now in
>
> commit 8c505bdc9c8b955223b054e34a0be9c3d841cd20 (drm-misc/drm-misc-next)
> Author: Christian König <christian.koenig at amd.com>
> Date:   Wed Jun 9 13:51:36 2021 +0200
>
>      drm/amdgpu: rework dma_resv handling v3
>
> Use the audit covered in this commit message as the excuse to update
> the dma-buf docs around dma_buf.resv usage across drivers.
>
> Since dynamic importers have different rules also hammer these in
> again while we're at it.
>
> Cc: mesa-dev at lists.freedesktop.org
> Cc: Bas Nieuwenhuizen <bas at basnieuwenhuizen.nl>
> Cc: Dave Airlie <airlied at gmail.com>
> Cc: Rob Clark <robdclark at chromium.org>
> Cc: Kristian H. Kristensen <hoegsberg at google.com>
> Cc: Michel Dänzer <michel at daenzer.net>
> Cc: Daniel Stone <daniels at collabora.com>
> Cc: Sumit Semwal <sumit.semwal at linaro.org>
> Cc: "Christian König" <christian.koenig at amd.com>
> Cc: Alex Deucher <alexander.deucher at amd.com>
> Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
> Cc: Deepak R Varma <mh12gx2825 at gmail.com>
> Cc: Chen Li <chenli at uniontech.com>
> Cc: Kevin Wang <kevin1.wang at amd.com>
> Cc: Dennis Li <Dennis.Li at amd.com>
> Cc: Luben Tuikov <luben.tuikov at amd.com>
> Cc: linaro-mm-sig at lists.linaro.org
> Signed-off-by: Daniel Vetter <daniel.vetter at intel.com>

Reviewed-by: Christian König <christian.koenig at amd.com>

> ---
>   include/linux/dma-buf.h | 39 +++++++++++++++++++++++++++++++++++++++
>   1 file changed, 39 insertions(+)
>
> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> index 6d18b9e448b9..4807cefe81f5 100644
> --- a/include/linux/dma-buf.h
> +++ b/include/linux/dma-buf.h
> @@ -388,6 +388,45 @@ struct dma_buf {
>   	 * @resv:
>   	 *
>   	 * Reservation object linked to this dma-buf.
> +	 *
> +	 * IMPLICIT SYNCHRONIZATION RULES:
> +	 *
> +	 * Drivers which support implicit synchronization of buffer access as
> +	 * e.g. exposed in `Implicit Fence Poll Support`_ should follow the
> +	 * below rules.
> +	 *
> +	 * - Drivers should add a shared fence through
> +	 *   dma_resv_add_shared_fence() for anything the userspace API
> +	 *   considers a read access. This highly depends upon the API and
> +	 *   window system: E.g. OpenGL is generally implicitly synchronized on
> +	 *   Linux, but explicitly synchronized on Android. Whereas Vulkan is
> +	 *   generally explicitly synchronized for everything, and window system
> +	 *   buffers have explicit API calls (which then need to make sure the
> +	 *   implicit fences store here in @resv are updated correctly).
> +	 *
> +	 * - Similarly drivers should set the exclusive fence through
> +	 *   dma_resv_add_excl_fence() for anything the userspace API considers
> +	 *   write access.
> +	 *
> +	 * - Drivers may just always set the exclusive fence, since that only
> +	 *   causes unecessarily synchronization, but no correctness issues.
> +	 *
> +	 * - Some drivers only expose a synchronous userspace API with no
> +	 *   pipelining across drivers. These do not set any fences for their
> +	 *   access. An example here is v4l.
> +	 *
> +	 * DYNAMIC IMPORTER RULES:
> +	 *
> +	 * Dynamic importers, see dma_buf_attachment_is_dynamic(), have
> +	 * additional constraints on how they set up fences:
> +	 *
> +	 * - Dynamic importers must obey the exclusive fence and wait for it to
> +	 *   signal before allowing access to the buffer's underlying storage
> +	 *   through.
> +	 *
> +	 * - Dynamic importers should set fences for any access that they can't
> +	 *   disable immediately from their @dma_buf_attach_ops.move_notify
> +	 *   callback.
>   	 */
>   	struct dma_resv *resv;
>