[RFC PATCH V2] drm/xe/guc: Use exec queue hints for GT frequency

Fri Jan 10 06:46:03 UTC 2025


> -----Original Message-----
> From: Zeng, Oak <oak.zeng at intel.com>
> Sent: Friday, January 10, 2025 4:30 AM
> To: Belgaumkar, Vinay <vinay.belgaumkar at intel.com>; Upadhyay, Tejas
> <tejas.upadhyay at intel.com>; intel-xe at lists.freedesktop.org
> Cc: dri-devel at lists.freedesktop.org; Nilawar, Badal
> <badal.nilawar at intel.com>; Mrozek, Michal <michal.mrozek at intel.com>;
> Morek, Szymon <szymon.morek at intel.com>; Souza, Jose
> <jose.souza at intel.com>; De Marchi, Lucas <lucas.demarchi at intel.com>
> Subject: RE: [RFC PATCH V2] drm/xe/guc: Use exec queue hints for GT
> frequency
> 
> 
> 
> > -----Original Message-----
> > From: Belgaumkar, Vinay <vinay.belgaumkar at intel.com>
> > Sent: January 9, 2025 5:03 PM
> > To: Zeng, Oak <oak.zeng at intel.com>; Upadhyay, Tejas
> > <tejas.upadhyay at intel.com>; intel-xe at lists.freedesktop.org
> > Cc: dri-devel at lists.freedesktop.org; Nilawar, Badal
> > <badal.nilawar at intel.com>; Mrozek, Michal <michal.mrozek at intel.com>;
> > Morek, Szymon <szymon.morek at intel.com>; Souza, Jose
> > <jose.souza at intel.com>; De Marchi, Lucas <lucas.demarchi at intel.com>
> > Subject: Re: [RFC PATCH V2] drm/xe/guc: Use exec queue hints for GT
> > frequency
> >
> >
> > On 1/9/2025 9:37 AM, Zeng, Oak wrote:
> > >
> > >> -----Original Message-----
> > >> From: dri-devel <dri-devel-bounces at lists.freedesktop.org> On Behalf
> > >> Of Tejas Upadhyay
> > >> Sent: January 9, 2025 7:07 AM
> > >> To: intel-xe at lists.freedesktop.org
> > >> Cc: dri-devel at lists.freedesktop.org; Nilawar, Badal
> > >> <badal.nilawar at intel.com>; Belgaumkar, Vinay
> > >> <vinay.belgaumkar at intel.com>; Mrozek, Michal
> > >> <michal.mrozek at intel.com>; Morek, Szymon
> <szymon.morek at intel.com>;
> > >> Souza, Jose
> > <jose.souza at intel.com>;
> > >> De Marchi, Lucas <lucas.demarchi at intel.com>; Upadhyay, Tejas
> > >> <tejas.upadhyay at intel.com>
> > >> Subject: [RFC PATCH V2] drm/xe/guc: Use exec queue hints for GT
> > >> frequency
> > >>
> > >> Allow user to provide a low latency hint per exec queue. When set,
> > >> KMD sends a hint to GuC which results in special handling for this
> > >> exec queue. SLPC will ramp the GT frequency aggressively every
> > time
> > >> it switches to this exec queue.
> > >>
> > >> We need to enable the use of SLPC Compute strategy during init,
> > but
> > >> it will apply only to exec queues that set this bit during exec
> > >> queue creation.
> > >>
> > >> Improvement with this approach as below:
> > >>
> > >> Before,
> > >>
> > >> :~$ NEOReadDebugKeys=1 EnableDirectSubmission=0 clpeak --
> > >> kernel-latency
> > >> Platform: Intel(R) OpenCL Graphics
> > >>    Device: Intel(R) Graphics [0xe20b]
> > >>      Driver version  : 24.52.0 (Linux x64)
> > >>      Compute units   : 160
> > >>      Clock frequency : 2850 MHz
> > >>      Kernel launch latency : 283.16 us
> > >>
> > >> After,
> > >>
> > >> :~$ NEOReadDebugKeys=1 EnableDirectSubmission=0 clpeak --
> > >> kernel-latency
> > >> Platform: Intel(R) OpenCL Graphics
> > >>    Device: Intel(R) Graphics [0xe20b]
> > >>      Driver version  : 24.52.0 (Linux x64)
> > >>      Compute units   : 160
> > >>      Clock frequency : 2850 MHz
> > >>
> > >>      Kernel launch latency : 63.38 us
> > >>
> > >> UMD will indicate low latency hint with flag as mentioned below,
> > >>
> > >> *     struct drm_xe_exec_queue_create exec_queue_create = {
> > >> *          .flags = DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT or 0
> > >> *          .extensions = 0,
> > >> *          .vm_id = vm,
> > >> *          .num_bb_per_exec = 1,
> > >> *          .num_eng_per_bb = 1,
> > >> *          .instances = to_user_pointer(&instance),
> > >> *     };
> > >> *     ioctl(fd, DRM_IOCTL_XE_EXEC_QUEUE_CREATE,
> > >> &exec_queue_create);
> > >>
> > >> Link to UMD PR : https://github.com/intel/compute-
> > runtime/pull/794
> > >>
> > >> Note: There is outstanding issue on guc side to be not able to
> > switch
> > >> to max
> > >> frequency as per strategy indicated by KMD, so for
> > experminet/test
> > >> result
> > >> hardcoding apporch was taken and passed to guc as policy. Effort
> > on
> > >> debugging
> > >> from guc side is going on in parallel.
> > >>
> > >> V2:
> > >>    - DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT 1 is already
> > planned
> > >> for other hint(Szymon)
> > >>    - Add motivation to description (Lucas)
> > >>
> > >> Cc:dri-devel at lists.freedesktop.org
> > >> Cc:vinay.belgaumkar at intel.com
> > >> Cc:Michal Mrozek <michal.mrozek at intel.com> Cc:Szymon Morek
> > >> <szymon.morek at intel.com> Cc:José Roberto de Souza
> > >> <jose.souza at intel.com>
> > >> Signed-off-by: Tejas Upadhyay <tejas.upadhyay at intel.com>
> > >> ---
> > >>   drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h |  3 +++
> > >>   drivers/gpu/drm/xe/xe_exec_queue.c            |  7 ++++---
> > >>   drivers/gpu/drm/xe/xe_guc_pc.c                | 16
> > ++++++++++++++++
> > >>   drivers/gpu/drm/xe/xe_guc_submit.c            |  7 +++++++
> > >>   include/uapi/drm/xe_drm.h                     |  3 ++-
> > >>   5 files changed, 32 insertions(+), 4 deletions(-)
> > >>
> > >> diff --git a/drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h
> > >> b/drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h
> > >> index 85abe4f09ae2..c50075b8270f 100644
> > >> --- a/drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h
> > >> +++ b/drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h
> > >> @@ -174,6 +174,9 @@ struct slpc_task_state_data {
> > >>   	};
> > >>   } __packed;
> > >>
> > >> +#define SLPC_EXEC_QUEUE_FREQ_REQ_IS_COMPUTE
> > >> 	REG_BIT(28)
> > >> +#define SLPC_OPTIMIZED_STRATEGY_COMPUTE
> > >> 	REG_BIT(0)
> > >> +
> > >>   struct slpc_shared_data_header {
> > >>   	/* Total size in bytes of this shared buffer. */
> > >>   	u32 size;
> > >> diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c
> > >> b/drivers/gpu/drm/xe/xe_exec_queue.c
> > >> index 8948f50ee58f..7747ba6c4bb8 100644
> > >> --- a/drivers/gpu/drm/xe/xe_exec_queue.c
> > >> +++ b/drivers/gpu/drm/xe/xe_exec_queue.c
> > >> @@ -553,7 +553,8 @@ int xe_exec_queue_create_ioctl(struct
> > >> drm_device *dev, void *data,
> > >>   	u32 len;
> > >>   	int err;
> > >>
> > >> -	if (XE_IOCTL_DBG(xe, args->flags) ||
> > >> +	if (XE_IOCTL_DBG(xe, args->flags &&
> > >> +			 !(args->flags &
> > >> DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT)) ||
> > >>   	    XE_IOCTL_DBG(xe, args->reserved[0] || args->reserved[1]))
> > >>   		return -EINVAL;
> > >>
> > >> @@ -578,7 +579,7 @@ int xe_exec_queue_create_ioctl(struct
> > >> drm_device *dev, void *data,
> > >>
> > >>   		for_each_tile(tile, xe, id) {
> > >>   			struct xe_exec_queue *new;
> > >> -			u32 flags = EXEC_QUEUE_FLAG_VM;
> > >> +			u32 flags = args->flags |
> > >> EXEC_QUEUE_FLAG_VM;
> > >
> > > You are mixing internal and external flags here. Args->flags is an
> > external definition. Note the current value of
> > > DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT conflict with the
> > internal definition of:
> > >
> > > #define EXEC_QUEUE_FLAG_PERMANENT		BIT(1)
> > >
> > > I think the better way to do it is, define an internal value for
> > > this
> > purpose, such as:
> > >
> > > #define EXEC_QUEUE_FLAG_LOW_LATENCY		BIT(5)
> > >
> > > Then write: if (args->flags &
> > DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT)
> > > 		flags |= EXEC_QUEUE_FLAG_LOW_LATENCY;
> > >
> > >
> > >>   			if (id)
> > >>   				flags |=
> > >> EXEC_QUEUE_FLAG_BIND_ENGINE_CHILD;
> > >> @@ -626,7 +627,7 @@ int xe_exec_queue_create_ioctl(struct
> > >> drm_device *dev, void *data,
> > >>   		}
> > >>
> > >>   		q = xe_exec_queue_create(xe, vm, logical_mask,
> > >> -					 args->width, hwe, 0,
> > >> +					 args->width, hwe, args->flags,
> > > Use internal flag also here if you do what I said above
> > >
> > >
> > >>   					 args->extensions);
> > >>   		up_read(&vm->lock);
> > >>   		xe_vm_put(vm);
> > >> diff --git a/drivers/gpu/drm/xe/xe_guc_pc.c
> > >> b/drivers/gpu/drm/xe/xe_guc_pc.c index df7f130fb663..ff0b98ccf1a7
> > >> 100644
> > >> --- a/drivers/gpu/drm/xe/xe_guc_pc.c
> > >> +++ b/drivers/gpu/drm/xe/xe_guc_pc.c
> > >> @@ -992,6 +992,19 @@ static int pc_init_freqs(struct xe_guc_pc
> > *pc)
> > >>   	return ret;
> > >>   }
> > >>
> > >> +static int xe_guc_pc_set_strategy(struct xe_guc_pc *pc, u32 val) {
> > >> +	int ret = 0;
> > >> +
> > >> +	xe_pm_runtime_get(pc_to_xe(pc));
> > >> +	ret = pc_action_set_param(pc,
> > >> +				  SLPC_PARAM_STRATEGIES,
> > >> +				  val);
> > >> +	xe_pm_runtime_put(pc_to_xe(pc));
> > >> +
> > >> +	return ret;
> > >> +}
> > >> +
> > >>   /**
> > >>    * xe_guc_pc_start - Start GuC's Power Conservation component
> > >>    * @pc: Xe_GuC_PC instance
> > >> @@ -1052,6 +1065,9 @@ int xe_guc_pc_start(struct xe_guc_pc
> > *pc)
> > >>
> > >>   	ret = pc_action_setup_gucrc(pc,
> > >> GUCRC_FIRMWARE_CONTROL);
> > >>
> > >> +	/* Enable SLPC Optimized Strategy for compute */
> > >> +	xe_guc_pc_set_strategy(pc,
> > >> SLPC_OPTIMIZED_STRATEGY_COMPUTE);
> > >> +
> > >>   out:
> > >>   	xe_force_wake_put(gt_to_fw(gt), fw_ref);
> > >>   	return ret;
> > >> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c
> > >> b/drivers/gpu/drm/xe/xe_guc_submit.c
> > >> index 9c36329fe857..88a1987ac360 100644
> > >> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > >> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > >> @@ -15,6 +15,7 @@
> > >>   #include <drm/drm_managed.h>
> > >>
> > >>   #include "abi/guc_actions_abi.h"
> > >> +#include "abi/guc_actions_slpc_abi.h"
> > >>   #include "abi/guc_klvs_abi.h"
> > >>   #include "regs/xe_lrc_layout.h"
> > >>   #include "xe_assert.h"
> > >> @@ -400,6 +401,7 @@ static void
> > >> __guc_exec_queue_policy_add_##func(struct
> > exec_queue_policy
> > >> *policy,
> > >>   MAKE_EXEC_QUEUE_POLICY_ADD(execution_quantum,
> > >> EXECUTION_QUANTUM)
> > >>   MAKE_EXEC_QUEUE_POLICY_ADD(preemption_timeout,
> > >> PREEMPTION_TIMEOUT)
> > >>   MAKE_EXEC_QUEUE_POLICY_ADD(priority,
> > SCHEDULING_PRIORITY)
> > >> +MAKE_EXEC_QUEUE_POLICY_ADD(slpc_ctx_freq_req,
> > >> SLPM_GT_FREQUENCY)
> > >>   #undef MAKE_EXEC_QUEUE_POLICY_ADD
> > >>
> > >>   static const int xe_exec_queue_prio_to_guc[] = { @@ -414,14
> > >> +416,19 @@ static void init_policies(struct xe_guc
> > *guc,
> > >> struct xe_exec_queue *q)
> > >>   	struct exec_queue_policy policy;
> > >>   	enum xe_exec_queue_priority prio = q-
> > >>> sched_props.priority;
> > >>   	u32 timeslice_us = q->sched_props.timeslice_us;
> > >> +	u32 slpc_ctx_freq_req = 0;
> > >>   	u32 preempt_timeout_us = q-
> > >>> sched_props.preempt_timeout_us;
> > >>   	xe_gt_assert(guc_to_gt(guc), exec_queue_registered(q));
> > >>
> > >> +	if (q->flags & DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT)
> > > Use internal definition
> > >
> > >> +		slpc_ctx_freq_req |=
> > >> SLPC_EXEC_QUEUE_FREQ_REQ_IS_COMPUTE;
> > >  From the codes above, I feel the user hint flag better be named as
> > > Something like DRM_XE_EXEC_QUEUE_COMPUTE_HINT
> > >
> > > I feel this is slightly better than LOW_LATENCY as low latency is a
> > result
> > > Of applying some frequency policy designed for compute task.
> > We had this discussion while implementing it on the i915 side. It's
> > better to keep it engine non-specific, that way non-compute(3D)
> > contexts can use it as well.
> 
> Ok, then COMPUTE is not the correct word.
> 
> I still think LOW_LATENCY is not the right wording here, ie., during submission,
> the result of Applying such policy is low latency, but during computing, the
> result can be faster.
> 
> A few other wording come to my mind are: HIGH_FREQUENCY,
> AGGRESSIVE_FREQUENCY, INTENSIVE_FREQUENCY or
> VIGOROUS_FREQUENCY
> 
> If LOW_LATENCY was accepted at i915 time, maybe keep it.

Sure

Tejas
> 
> > >
> > > A related question is, does this new policy affect power
> > consumption? Usually
> > > Higher frequency implies higher power.
> > >
> > > Do we only want to keep such frequency policy during submission
> > time, or
> > > Same policy is intended for whole execution time?
> > SLPC will switch to aggressive frequency strategies when that
> > particular context is switched in. Other contexts are not affected.
> > There is definitely a power impact since we are aggressively ramping
> > the frequency for this context.
> 
> So policy applied to both submission and execution for the context.
> 
> > >
> > > The answer of above question would lead to some interface design
> > such as
> > > Whether we need to introduce interface to disable the compute
> > frequency policy.
> >
> > It can be disabled on the fly for any context, since this is just
> > another SLPC param. However, I believe the upstream policy is to not
> > mutate the context after it has been created.
> 
> 
> Ok, as long as we are aware of, and people not complaining of the power
> consumption change After we apply this policy, making is immutable is fine.
> 
> Oak
> 
> >
> > Thanks,
> >
> > Vinay.
> >
> > >
> > >
> > >> +
> > >>   	__guc_exec_queue_policy_start_klv(&policy, q->guc->id);
> > >>   	__guc_exec_queue_policy_add_priority(&policy,
> > >> xe_exec_queue_prio_to_guc[prio]);
> > >>
> > >> 	__guc_exec_queue_policy_add_execution_quantum(&polic
> > >> y, timeslice_us);
> > >>
> > >> 	__guc_exec_queue_policy_add_preemption_timeout(&polic
> > >> y, preempt_timeout_us);
> > >> +	__guc_exec_queue_policy_add_slpc_ctx_freq_req(&policy,
> > >> slpc_ctx_freq_req);
> > >>
> > >>   	xe_guc_ct_send(&guc->ct, (u32 *)&policy.h2g,
> > >>   		       __guc_exec_queue_policy_action_size(&policy),
> > >> 0, 0);
> > >> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
> > >> index f62689ca861a..bd0150d2200c 100644
> > >> --- a/include/uapi/drm/xe_drm.h
> > >> +++ b/include/uapi/drm/xe_drm.h
> > >> @@ -1097,6 +1097,7 @@ struct drm_xe_vm_bind {
> > >>    *         .engine_class = DRM_XE_ENGINE_CLASS_RENDER,
> > >>    *     };
> > >>    *     struct drm_xe_exec_queue_create exec_queue_create = {
> > >> + *          .flags = DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT or 0
> > >>    *          .extensions = 0,
> > >>    *          .vm_id = vm,
> > >>    *          .num_bb_per_exec = 1,
> > >> @@ -1110,7 +1111,6 @@ struct drm_xe_exec_queue_create {
> > >>   #define DRM_XE_EXEC_QUEUE_EXTENSION_SET_PROPERTY
> > >> 	0
> > >>   #define   DRM_XE_EXEC_QUEUE_SET_PROPERTY_PRIORITY
> > >> 	0
> > >>   #define   DRM_XE_EXEC_QUEUE_SET_PROPERTY_TIMESLICE
> > >> 	1
> > >> -
> > >>   	/** @extensions: Pointer to the first extension struct, if any
> > >> */
> > >>   	__u64 extensions;
> > >>
> > >> @@ -1123,6 +1123,7 @@ struct drm_xe_exec_queue_create {
> > >>   	/** @vm_id: VM to use for this exec queue */
> > >>   	__u32 vm_id;
> > >>
> > >> +#define DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT	(0x1
> > >> << 1)
> > > I wonder why flag can't start from (0x1 << 0) here.
> > >
> > > Yes I did see the v2 comment of below:
> > >
> > >   - DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT 1 is already
> > planned
> > > for other hint(Szymon)
> > >
> > > but this is regarding the definition of a flag used in
> > exec_queue_create, and this
> > > is the first time we use this flag in *this* uAPI. It shouldn't
> > > conflict
> > with UMD's usage of hint flags
> > > in *other* uAPI
> > >
> > >>   	/** @flags: MBZ */
> > > Remove this MBZ comment
> > >
> > > Oak
> > >
> > >
> > >>   	__u32 flags;
> > >>
> > >> --
> > >> 2.34.1