[Intel-gfx] [PATCH v3 4/4] drm/i915: Improve long running OCL w/a for GuC submission

John Harrison john.c.harrison at intel.com
Wed Mar 9 21:16:34 UTC 2022


On 3/8/2022 01:41, Tvrtko Ursulin wrote:
> On 03/03/2022 22:37, John.C.Harrison at Intel.com wrote:
>> From: John Harrison <John.C.Harrison at Intel.com>
>>
>> A workaround was added to the driver to allow OpenCL workloads to run
>> 'forever' by disabling pre-emption on the RCS engine for Gen12.
>> It is not totally unbound as the heartbeat will kick in eventually
>> and cause a reset of the hung engine.
>>
>> However, this does not work well in GuC submission mode. In GuC mode,
>> the pre-emption timeout is how GuC detects hung contexts and triggers
>> a per engine reset. Thus, disabling the timeout means also losing all
>> per engine reset ability. A full GT reset will still occur when the
>> heartbeat finally expires, but that is a much more destructive and
>> undesirable mechanism.
>>
>> The purpose of the workaround is actually to give OpenCL tasks longer
>> to reach a pre-emption point after a pre-emption request has been
>> issued. This is necessary because Gen12 does not support mid-thread
>> pre-emption and OpenCL can have long running threads.
>>
>> So, rather than disabling the timeout completely, just set it to a
>> 'long' value.
>>
>> v2: Review feedback from Tvrtko - must hard code the 'long' value
>> instead of determining it algorithmically. So make it an extra CONFIG
>> definition. Also, remove the execlist centric comment from the
>> existing pre-emption timeout CONFIG option given that it applies to
>> more than just execlists.
>>
>> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
>> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio at intel.com> 
>> (v1)
>> Acked-by: Michal Mrozek <michal.mrozek at intel.com>
>> ---
>>   drivers/gpu/drm/i915/Kconfig.profile      | 26 +++++++++++++++++++----
>>   drivers/gpu/drm/i915/gt/intel_engine_cs.c |  9 ++++++--
>>   2 files changed, 29 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/Kconfig.profile 
>> b/drivers/gpu/drm/i915/Kconfig.profile
>> index 39328567c200..7cc38d25ee5c 100644
>> --- a/drivers/gpu/drm/i915/Kconfig.profile
>> +++ b/drivers/gpu/drm/i915/Kconfig.profile
>> @@ -57,10 +57,28 @@ config DRM_I915_PREEMPT_TIMEOUT
>>       default 640 # milliseconds
>>       help
>>         How long to wait (in milliseconds) for a preemption event to 
>> occur
>> -      when submitting a new context via execlists. If the current 
>> context
>> -      does not hit an arbitration point and yield to HW before the 
>> timer
>> -      expires, the HW will be reset to allow the more important context
>> -      to execute.
>> +      when submitting a new context. If the current context does not 
>> hit
>> +      an arbitration point and yield to HW before the timer expires, 
>> the
>> +      HW will be reset to allow the more important context to execute.
>> +
>> +      This is adjustable via
>> +      /sys/class/drm/card?/engine/*/preempt_timeout_ms
>> +
>> +      May be 0 to disable the timeout.
>> +
>> +      The compiled in default may get overridden at driver probe 
>> time on
>> +      certain platforms and certain engines which will be reflected 
>> in the
>> +      sysfs control.
>> +
>> +config DRM_I915_PREEMPT_TIMEOUT_COMPUTE
>> +    int "Preempt timeout for compute engines (ms, jiffy granularity)"
>> +    default 7500 # milliseconds
>> +    help
>> +      How long to wait (in milliseconds) for a preemption event to 
>> occur
>> +      when submitting a new context to a compute capable engine. If the
>> +      current context does not hit an arbitration point and yield to HW
>> +      before the timer expires, the HW will be reset to allow the more
>> +      important context to execute.
>>           This is adjustable via
>>         /sys/class/drm/card?/engine/*/preempt_timeout_ms
>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
>> b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> index 4185c7338581..cc0954ad836a 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> @@ -438,9 +438,14 @@ static int intel_engine_setup(struct intel_gt 
>> *gt, enum intel_engine_id id,
>>       engine->props.timeslice_duration_ms =
>>           CONFIG_DRM_I915_TIMESLICE_DURATION;
>>   -    /* Override to uninterruptible for OpenCL workloads. */
>> +    /*
>> +     * Mid-thread pre-emption is not available in Gen12. Unfortunately,
>> +     * some OpenCL workloads run quite long threads. That means they 
>> get
>> +     * reset due to not pre-empting in a timely manner. So, bump the
>> +     * pre-emption timeout value to be much higher for compute engines.
>> +     */
>>       if (GRAPHICS_VER(i915) == 12 && (engine->flags & 
>> I915_ENGINE_HAS_RCS_REG_STATE))
>> -        engine->props.preempt_timeout_ms = 0;
>> +        engine->props.preempt_timeout_ms = 
>> CONFIG_DRM_I915_PREEMPT_TIMEOUT_COMPUTE;
>
> I wouldn't go as far as adding a config option since as it is it only 
> applies to Gen12 but Kconfig text says nothing about that. And I am 
> not saying you should add a Gen12 specific config option, that would 
> be weird. So IMO just drop it.
>
You were the one arguing that the driver was illegally overriding the 
user's explicitly chosen settings, including the compile time config 
options. Just having a hardcoded magic number in the driver is the 
absolute worst kind of override there is.

And technically, the config option is not Gen12 specific. It is actually 
compute specific, hence the name. It just so happens that only Gen12 
onwards has compute engines. I can add an extra line to the Kconfig 
description if you want "NB: compute engines only exist on Gen12 but do 
include the RCS engine on Gen12".

John.


> Regards,
>
> Tvrtko
>
>>         /* Cap properties according to any system limits */
>>   #define CLAMP_PROP(field) \



More information about the Intel-gfx mailing list