[Intel-gfx] [RFC PATCH 4/5] drm/i915: Introduce 'set parallel submit' extension

Tue May 11 15:11:44 UTC 2021

On Thu, May 06, 2021 at 10:30:48AM -0700, Matthew Brost wrote:
> i915_drm.h updates for 'set parallel submit' extension.
> 
> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> Cc: Tony Ye <tony.ye at intel.com>
> CC: Carl Zhang <carl.zhang at intel.com>
> Cc: Daniel Vetter <daniel.vetter at intel.com>
> Cc: Jason Ekstrand <jason at jlekstrand.net>
> Signed-off-by: Matthew Brost <matthew.brost at intel.com>
> ---
>  include/uapi/drm/i915_drm.h | 126 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 126 insertions(+)
> 
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 26d2e135aa31..0175b12b33b8 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1712,6 +1712,7 @@ struct drm_i915_gem_context_param {
>   * Extensions:
>   *   i915_context_engines_load_balance (I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE)
>   *   i915_context_engines_bond (I915_CONTEXT_ENGINES_EXT_BOND)
> + *   i915_context_engines_parallel_submit (I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT)

Hm just relalized, but I don't think this hyperlinsk correctly, and I'm
also not sure this formats very well as a nice list. Using item lists
should look pretty nice like we're doing for the various kms properties,
e.g.

FOO:
  Explain what FOO does

BAR:
  Explain what BAR does. struct bar also automatically generates a link

Please check with make htmldocs and polish this a bit (might need a small
prep patch).

>   */
>  #define I915_CONTEXT_PARAM_ENGINES	0xa
>  
> @@ -1894,9 +1895,134 @@ struct i915_context_param_engines {
>  	__u64 extensions; /* linked chain of extension blocks, 0 terminates */
>  #define I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE 0 /* see i915_context_engines_load_balance */
>  #define I915_CONTEXT_ENGINES_EXT_BOND 1 /* see i915_context_engines_bond */
> +#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
>  	struct i915_engine_class_instance engines[0];
>  } __attribute__((packed));
>  
> +/*
> + * i915_context_engines_parallel_submit:
> + *
> + * Setup a gem context to allow multiple BBs to be submitted in a single execbuf
> + * IOCTL. Those BBs will then be scheduled to run on the GPU in parallel.
> + *
> + * All hardware contexts in the engine set are configured for parallel
> + * submission (i.e. once this gem context is configured for parallel submission,
> + * all the hardware contexts, regardless if a BB is available on each individual
> + * context, will be submitted to the GPU in parallel). A user can submit BBs to
> + * subset of the hardware contexts, in a single execbuf IOCTL, but it is not
> + * recommended as it may reserve physical engines with nothing to run on them.
> + * Highly recommended to configure the gem context with N hardware contexts then
> + * always submit N BBs in a single IOCTL.
> + *
> + * Their are two currently defined ways to control the placement of the
> + * hardware contexts on physical engines: default behavior (no flags) and
> + * I915_PARALLEL_IMPLICT_BONDS (a flag). More flags may be added the in the
> + * future as new hardware / use cases arise. Details of how to use this
> + * interface below above the flags.
> + *
> + * Returns -EINVAL if hardware context placement configuration invalid or if the
> + * placement configuration isn't supported on the platform / submission
> + * interface.
> + * Returns -ENODEV if extension isn't supported on the platform / submission
> + * inteface.
> + */
> +struct i915_context_engines_parallel_submit {
> +	struct i915_user_extension base;

Ok this is good, since it makes sure we can't possible use this in
CTX_SETPARAM.

> +
> +/*
> + * Default placement behvavior (currently unsupported):
> + *
> + * Rather than restricting parallel submission to a single class with a
> + * logically contiguous placement (I915_PARALLEL_IMPLICT_BONDS), add a mode that
> + * enables parallel submission across multiple engine classes. In this case each
> + * context's logical engine mask indicates where that context can placed. It is
> + * implied in this mode that all contexts have mutual exclusive placement (e.g.
> + * if one context is running CS0 no other contexts can run on CS0).
> + *
> + * Example 1 pseudo code:
> + * CSX[Y] = engine class X, logical instance Y
> + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> + * set_engines(INVALID, INVALID)
> + * set_load_balance(engine_index=0, num_siblings=2, engines=CS0[0],CS0[1])
> + * set_load_balance(engine_index=1, num_siblings=2, engines=CS1[0],CS1[1])
> + * set_parallel()
> + *
> + * Results in the following valid placements:
> + * CS0[0], CS1[0]
> + * CS0[0], CS1[1]
> + * CS0[1], CS1[0]
> + * CS0[1], CS1[1]
> + *
> + * Example 2 pseudo code:
> + * CS[X] = generic engine of same class, logical instance X
> + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> + * set_engines(INVALID, INVALID)
> + * set_load_balance(engine_index=0, num_siblings=3, engines=CS[0],CS[1],CS[2])
> + * set_load_balance(engine_index=1, num_siblings=3, engines=CS[0],CS[1],CS[2])
> + * set_parallel()
> + *
> + * Results in the following valid placements:
> + * CS[0], CS[1]
> + * CS[0], CS[2]
> + * CS[1], CS[0]
> + * CS[1], CS[2]
> + * CS[2], CS[0]
> + * CS[2], CS[1]
> + *
> + * This enables a use case where all engines are created equally, we don't care
> + * where they are scheduled, we just want a certain number of resources, for
> + * those resources to be scheduled in parallel, and possibly across multiple
> + * engine classes.
> + */
> +
> +/*
> + * I915_PARALLEL_IMPLICT_BONDS - Create implict bonds between each context.
> + * Each context must have the same number sibling and bonds are implictly create
> + * of the siblings.
> + *
> + * All of the below examples are in logical space.
> + *
> + * Example 1 pseudo code:
> + * CS[X] = generic engine of same class, logical instance X
> + * set_engines(CS[0], CS[1])
> + * set_parallel(flags=I915_PARALLEL_IMPLICT_BONDS)
> + *
> + * Results in the following valid placements:
> + * CS[0], CS[1]
> + *
> + * Example 2 pseudo code:
> + * CS[X] = generic engine of same class, logical instance X
> + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> + * set_engines(INVALID, INVALID)
> + * set_load_balance(engine_index=0, num_siblings=2, engines=CS[0],CS[2])
> + * set_load_balance(engine_index=1, num_siblings=2, engines=CS[1],CS[3])
> + * set_parallel(flags=I915_PARALLEL_IMPLICT_BONDS)
> + *
> + * Results in the following valid placements:
> + * CS[0], CS[1]
> + * CS[2], CS[3]
> + *
> + * This enables a use case where all engines are not equal and certain placement
> + * rules are required (i.e. split-frame requires all contexts to be placed in a
> + * logically contiguous order on the VCS engines on gen11+ platforms). This use
> + * case (logically contiguous placement, within a single engine class) is
> + * supported when using GuC submission. Execlist mode could support all possible
> + * bonding configurations but currently doesn't support this extension.
> + */
> +#define I915_PARALLEL_IMPLICT_BONDS		(1<<0)
> +/*
> + * Do not allow BBs to be preempted mid BB rather insert coordinated preemption
> + * points on all hardware contexts between each set of BBs. An example use case
> + * of this feature is split-frame on gen11+ hardware. When using this feature a
> + * BB must be submitted on each hardware context in the parallel gem context.
> + * The execbuf2 IOCTL enforces the user adheres to policy.
> + */
> +#define I915_PARALLEL_NO_PREEMPT_MID_BATCH	(1<<1)
> +#define I915_PARALLEL_UNKNOWN_FLAGS  (-(I915_PARALLEL_NO_PREEMPT_MID_BATCH << 1))
> +	__u64 flags; /* all undefined flags must be zero */
> +	__u64 mbz64[4]; /* reserved for future use; must be zero */
> +} __attribute__ ((packed));

Ok I'm having some serious questions. This looks way too much like it's
inspired by bonded submission, and given we're tossing bonded submission
we need to make sure we're doing this for good independent reasons and not
just for intertia.

What I expected looking at how media-driver uses bonded submit currently
is:

- We create a parallel submit engine, which occupies a virtual engine
  slot. This parallel virtual engine contains all the information we need,
  i.e. the flags you have above, but also how many engines run in parallel
  and how each of those can be load-balanced. So probably a full NxM
  matrix of physical engines needed.

- Execbuf uses that parallel virtual engine to submit all N batchbuffers
  in one go.

- This means we don't create virtual engines (or physical engine mappings)
  for all the individual pieces in a parallel engine. That's a concept
  from bonded submission, and I think that needs to go.

- More important not having a parallel virtual engine breaks our already
  badly confusing gem ctx api. Ignoring parallel/bonded submit the gem ctx
  is just a container object, which points at a bunch of engines (plus the
  VM and a few other things). Having parallel context something that sits
  at the gem ctx level, and not as an individual engine (of which you can
  have multiple in the same gem ctx) breaks stuff. E.g. right the perf api
  sits at the gem ctx level, so that you can capture all the perf data for
  an entire workload spawning across multiple engines. If a workload now
  needs multiple parallel engines we'd need multiple gem ctx, which breaks
  this.

So what I'd expect we'd have here is roughly:

struct i915_context_engines_parallel_submit {
	struct i915_user_extension base;
	__u64 flags;
	__u32 num_engines; /* N, must match what we submit in the execbuf */
	__u32 num_siblings; /* M, I'm assuming it's ok we require that siblings must match across the entire set of parallel engines */
	struct engine_info[]; /* NxM array of engine infos, pls fill in the right struct name :-) */
};

If we then also require that you always submit the full width of N
batchbuffers then even the execbuf extension doesn't need to exist
anymore, because the virtual parallel engine already contains all the
needed information.

And sure for some backends at least (definitely execlist) we'd need to
create a bunch of additional virtual engines behind that virtual engine.
But they'd be entirely hidden, and not visible to userspace nor the higher
levels.

What am I missing?
-Daniel

>  #define I915_DEFINE_CONTEXT_PARAM_ENGINES(name__, N__) struct { \
>  	__u64 extensions; \
>  	struct i915_engine_class_instance engines[N__]; \
> -- 
> 2.28.0
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch