[PATCH 2/2] drm/doc/rfc: i915 new parallel submission uAPI plan

Wed Jun 30 09:11:53 UTC 2021

On 29/06/2021 22:35, Matthew Brost wrote:
> Add entry for i915 new parallel submission uAPI plan.
> 
> v2:
>   (Daniel Vetter):
>    - Expand logical order explaination
>    - Add dummy header
>    - Only allow N BBs in execbuf IOCTL
>    - Configure parallel submission per slot not per gem context
> v3:
>   (Marcin Ślusarz):
>    - Lot's of typos / bad english fixed
>   (Tvrtko Ursulin):
>    - Consistent pseudo code, clean up wording in descriptions
> v4:
>   (Daniel Vetter)
>    - Drop flags
>    - Add kernel doc
>    - Reword a few things / fix typos
>   (Tvrtko)
>    - Reword a few things / fix typos
> v5:
>   (Checkpatch)
>    - Fix typos
>   (Docs)
>    - Fix warning
> 
> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> Cc: Tony Ye <tony.ye at intel.com>
> CC: Carl Zhang <carl.zhang at intel.com>
> Cc: Daniel Vetter <daniel.vetter at intel.com>
> Cc: Jason Ekstrand <jason at jlekstrand.net>
> Signed-off-by: Matthew Brost <matthew.brost at intel.com>
> Acked-by: Daniel Vetter <daniel.vetter at ffwll.ch>
> Acked-by: Tony Ye <tony.ye at intel.com>
> ---
>   Documentation/gpu/rfc/i915_parallel_execbuf.h | 122 ++++++++++++++++++
>   Documentation/gpu/rfc/i915_scheduler.rst      |  59 ++++++++-
>   2 files changed, 180 insertions(+), 1 deletion(-)
>   create mode 100644 Documentation/gpu/rfc/i915_parallel_execbuf.h
> 
> diff --git a/Documentation/gpu/rfc/i915_parallel_execbuf.h b/Documentation/gpu/rfc/i915_parallel_execbuf.h
> new file mode 100644
> index 000000000000..8cbe2c4e0172
> --- /dev/null
> +++ b/Documentation/gpu/rfc/i915_parallel_execbuf.h
> @@ -0,0 +1,122 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2021 Intel Corporation
> + */
> +
> +#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
> +
> +/**
> + * struct drm_i915_context_engines_parallel_submit - Configure engine for
> + * parallel submission.
> + *
> + * Setup a slot in the context engine map to allow multiple BBs to be submitted
> + * in a single execbuf IOCTL. Those BBs will then be scheduled to run on the GPU
> + * in parallel. Multiple hardware contexts are created internally in the i915
> + * run these BBs.

The previous sentence makes little sense.

>  Once a slot is configured for N BBs only N BBs can be
> + * submitted in each execbuf IOCTL and this is implicit behavior e.g. The user
> + * doesn't tell the execbuf IOCTL there are N BBs, the execbuf IOCTL knows how
> + * many BBs there are based on the slot's configuration. The N BBs are the last
> + * N buffer objects or first N if I915_EXEC_BATCH_FIRST is set.
> + *
> + * The default placement behavior is to create implicit bonds between each
> + * context if each context maps to more than 1 physical engine (e.g. context is
> + * a virtual engine). Also we only allow contexts of same engine class and these
> + * contexts must be in logically contiguous order. Examples of the placement
> + * behavior described below. Lastly, the default is to not allow BBs to
> + * preempted mid BB rather insert coordinated preemption on all hardware
> + * contexts between each set of BBs. Flags may be added in the future to change
> + * both of these default behaviors.
> + *
> + * Returns -EINVAL if hardware context placement configuration is invalid or if
> + * the placement configuration isn't supported on the platform / submission
> + * interface.
> + * Returns -ENODEV if extension isn't supported on the platform / submission
> + * interface.
> + *
> + * .. code-block:: none
> + *
> + *	Example 1 pseudo code:
> + *	CS[X] = generic engine of same class, logical instance X
> + *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> + *	set_engines(INVALID)
> + *	set_parallel(engine_index=0, width=2, num_siblings=1,
> + *		     engines=CS[0],CS[1])
> + *
> + *	Results in the following valid placement:
> + *	CS[0], CS[1]
> + *
> + *	Example 2 pseudo code:
> + *	CS[X] = generic engine of same class, logical instance X
> + *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> + *	set_engines(INVALID)
> + *	set_parallel(engine_index=0, width=2, num_siblings=2,
> + *		     engines=CS[0],CS[2],CS[1],CS[3])
> + *
> + *	Results in the following valid placements:
> + *	CS[0], CS[1]
> + *	CS[2], CS[3]
> + *
> + *	This can also be thought of as 2 virtual engines described by 2-D array
> + *	in the engines the field with bonds placed between each index of the
> + *	virtual engines. e.g. CS[0] is bonded to CS[1], CS[2] is bonded to
> + *	CS[3].
> + *	VE[0] = CS[0], CS[2]
> + *	VE[1] = CS[1], CS[3]
> + *
> + *	Example 3 pseudo code:
> + *	CS[X] = generic engine of same class, logical instance X
> + *	INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> + *	set_engines(INVALID)
> + *	set_parallel(engine_index=0, width=2, num_siblings=2,
> + *		     engines=CS[0],CS[1],CS[1],CS[3])
> + *
> + *	Results in the following valid and invalid placements:
> + *	CS[0], CS[1]
> + *	CS[1], CS[3] - Not logical contiguous, return -EINVAL
> + */
> +struct drm_i915_context_engines_parallel_submit {
> +	/**
> +	 * @base: base user extension.
> +	 */
> +	struct i915_user_extension base;
> +
> +	/**
> +	 * @engine_index: slot for parallel engine
> +	 */
> +	__u16 engine_index;
> +
> +	/**
> +	 * @width: number of contexts per parallel engine
> +	 */
> +	__u16 width;
> +
> +	/**
> +	 * @num_siblings: number of siblings per context
> +	 */
> +	__u16 num_siblings;
> +
> +	/**
> +	 * @mbz16: reserved for future use; must be zero
> +	 */
> +	__u16 mbz16;
> +
> +	/**
> +	 * @flags: all undefined flags must be zero, currently not defined flags
> +	 */
> +	__u64 flags;
> +
> +	/**
> +	 * @mbz64: reserved for future use; must be zero
> +	 */
> +	__u64 mbz64[3];
> +
> +	/**
> +	 * @engines: 2-d array of engine instances to configure parallel engine
> +	 *
> +	 * length = width (i) * num_siblings (j)
> +	 * index = j + i * num_siblings
> +	 */
> +	struct i915_engine_class_instance engines[0];
> +
> +} __packed;
> +
> diff --git a/Documentation/gpu/rfc/i915_scheduler.rst b/Documentation/gpu/rfc/i915_scheduler.rst
> index 7acd386a6b49..cbda75065dad 100644
> --- a/Documentation/gpu/rfc/i915_scheduler.rst
> +++ b/Documentation/gpu/rfc/i915_scheduler.rst
> @@ -88,4 +88,61 @@ Spec references:
>   
>   New parallel submission uAPI
>   ============================
> -Details to come in a following patch.
> +The existing bonding uAPI is completely broken with GuC submission because
> +whether a submission is a single context submit or parallel submit isn't known
> +until execbuf time activated via the I915_SUBMIT_FENCE. To submit multiple
> +contexts in parallel with the GuC the context must be explicitly registered with
> +N contexts and all N contexts must be submitted in a single command to the GuC.
> +The GuC interfaces do not support dynamically changing between N contexts as the
> +bonding uAPI does. Hence the need for a new parallel submission interface. Also
> +the legacy bonding uAPI is quite confusing and not intuitive at all. Furthermore
> +I915_SUBMIT_FENCE is by design a future fence, so not really something we should
> +continue to support.

A proprietary firmware interface should not be considered a good reason 
to retire a Linux uAPI. It is messages like this that make me super 
uneasy about the entire ordeal with the GuC: we would have no say on its 
interface, and Linux would be an afterthought of the firmware developers.

Anyway, I'm sure there are plenty of user-facing reasons to the original 
interface sucks, so please focus on that instead.

> +
> +The new parallel submission uAPI consists of 3 parts:
> +
> +* Export engines logical mapping
> +* A 'set_parallel' extension to configure contexts for parallel
> +  submission
> +* Extend execbuf2 IOCTL to support submitting N BBs in a single IOCTL
> +
> +Export engines logical mapping
> +------------------------------
> +Certain use cases require BBs to be placed on engine instances in logical order
> +(e.g. split-frame on gen11+). The logical mapping of engine instances can change
> +based on fusing.

This sentence can be quite confusing: How about "The logical mapping of 
engine instances can change from one GPU to another, as some engines can 
be fused-out at the factory for various reasons".

The above makes it clear that this would be fixed on a per-GPU basis, 
and what type of fusing we are talking about, making it easier to google 
for.

> Rather than making UMDs be aware of fusing, simply expose the
> +logical mapping with the existing query engine info IOCTL. Also the GuC
> +submission interface currently only supports submitting multiple contexts to
> +engines in logical order which is a new requirement compared to execlists.
> +Lastly, all current platforms have at most 2 engine instances and the logical
> +order is the same as uAPI order. This will change on platforms with more than 2
> +engine instances.
> +
> +A single bit will be added to drm_i915_engine_info.flags indicating that the
> +logical instance has been returned and a new field,
> +drm_i915_engine_info.logical_instance, returns the logical instance.
> +
> +A 'set_parallel' extension to configure contexts for parallel submission
> +------------------------------------------------------------------------
> +The 'set_parallel' extension configures a slot for parallel submission of N BBs.
> +It is a setup step that must be called before using any of the contexts. See
> +I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE or I915_CONTEXT_ENGINES_EXT_BOND for
> +similar existing examples. Once a slot is configured for parallel submission the
> +execbuf2 IOCTL can be called submitting N BBs in a single IOCTL. Initially only
> +supports GuC submission. Execlists supports can be added later if needed.
> +
> +Add I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT and
> +drm_i915_context_engines_parallel_submit to the uAPI to implement this
> +extension.
> +
> +.. kernel-doc:: Documentation/gpu/rfc/i915_parallel_execbuf.h
> +        :functions: drm_i915_context_engines_parallel_submit
> +
> +Extend execbuf2 IOCTL to support submitting N BBs in a single IOCTL
> +-------------------------------------------------------------------
> +Contexts that have been configured with the 'set_parallel' extension can only
> +submit N BBs in a single execbuf2 IOCTL. The BBs are either the last N objects
> +in the drm_i915_gem_exec_object2 list or the first N if I915_EXEC_BATCH_FIRST is
> +set. The number of BBs is implicit based on the slot submitted and how it has
> +been configured by 'set_parallel' or other extensions. No uAPI changes are
> +required to the execbuf2 IOCTL.
> 

Good job! The interface is pretty well described, and setting up ahead 
of time how many parallel contexts will be run should make it easier to 
reason with than the current interface.

That being said, a uAPI is not designed just with a firmware and driver 
in mind. Would you mind adding a list of use cases where this interface 
would be needed, and a small analysis about the impact of such change?

I mean, the userspace was designed with the old interface in mind, and 
forcing them to set up the number of parallel execution at context 
creation may lead them to submit empty command buffers when they would 
need a lower parallelism. Any recommendations there?

Cheers,
Martin