[Intel-gfx] [PATCH 07/11] drm/i915/tdr: Add support for per engine reset recovery
Arun Siluvery
arun.siluvery at linux.intel.com
Wed Jul 27 11:49:13 UTC 2016
On 26/07/2016 22:51, Chris Wilson wrote:
> On Tue, Jul 26, 2016 at 05:40:53PM +0100, Arun Siluvery wrote:
>> This change implements support for per-engine reset as an initial, less
>> intrusive hang recovery option to be attempted before falling back to the
>> legacy full GPU reset recovery mode if necessary. This is only supported
>> from Gen8 onwards.
>>
>> Hangchecker determines which engines are hung and invokes error handler to
>> recover from it. Error handler schedules recovery for each of those engines
>> that are hung. The recovery procedure is as follows,
>> - force engine to idle: this is done by issuing a reset request
>> - identifies the request that caused the hang and it is dropped
>> - reset and re-init engine
>> - restart submissions to the engine
>>
>> If engine reset fails then we fall back to heavy weight full gpu reset
>> which resets all engines and reinitiazes complete state of HW and SW.
>>
>> Possible reasons for failure,
>> - engine not ready for reset
>> - if the HW and SW doesn't agree on the context that caused the hang
>> - reset itself failed for some reason
>>
>> Cc: Chris Wilson <chris at chris-wilson.co.uk>
>> Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
>> Signed-off-by: Tomas Elf <tomas.elf at intel.com>
>> Signed-off-by: Arun Siluvery <arun.siluvery at linux.intel.com>
>> ---
>> drivers/gpu/drm/i915/i915_drv.c | 55 +++++++++++++++++++++++++++++++++----
>> drivers/gpu/drm/i915/i915_drv.h | 3 ++
>> drivers/gpu/drm/i915/i915_gem.c | 2 +-
>> drivers/gpu/drm/i915/intel_lrc.h | 4 +++
>> drivers/gpu/drm/i915/intel_uncore.c | 33 ++++++++++++++++++++++
>> 5 files changed, 90 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
>> index 5c20e5d..8151aa9 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.c
>> +++ b/drivers/gpu/drm/i915/i915_drv.c
>> @@ -1824,17 +1824,60 @@ error:
>> * Returns zero on successful reset or otherwise an error code.
>> *
>> * Procedure is fairly simple:
>> - * - force engine to idle
>> - * - save current state which includes head and current request
>> - * - reset engine
>> - * - restore saved state and resubmit context
>> + * - force engine to idle: this is done by issuing a reset request
>> + * - identifies the request that caused the hang and it is retired
>> + * - reset engine
>> + * - restart submissions to the engine
>> */
>> int i915_reset_engine(struct intel_engine_cs *engine)
>> {
>> + struct drm_i915_private *dev_priv = engine->i915;
>> int ret;
>>
>> - /* FIXME: replace me with engine reset sequence */
>> - ret = -ENODEV;
>> + /* Ensure irq handler finishes or is cancelled. */
>> + tasklet_kill(&engine->irq_tasklet);
>> +
>> + i915_gem_reset_engine_status(engine);
>
> Not yet safe to be used lockless.
>
engine is hung and intel_lrc_irq_handler() also won't be running at this
point and we should've received all seqno updates, is it still not safe?
do I really need struct_mutex for this? only caller is i915_gem_reset()
and it takes it.
>> + /*Take wake lock to prevent power saving mode */
>
> Why else would you be taking it? We know the wakelock prevents the GT
> power well disappearing, the question is why is that important now
> across the entirety of this sequence? That's what you explain in
> comments.
>
yes, not useful. It is carried over from initial patch but I agree I
should've updated this.
>> + intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
>> +
>> + ret = intel_request_for_reset(engine);
>> + if (ret) {
>> + DRM_ERROR("Failed to disable %s\n", engine->name);
>> + goto out;
>> + }
>> +
>> + ret = intel_execlists_reset_prepare(engine);
>> + if (ret)
>> + goto enable_engine;
>> +
>> + ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
>> + if (ret) {
>> + DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
>> + goto enable_engine;
>> + }
>> +
>> + ret = engine->init_hw(engine);
>> + if (ret)
>> + goto out;
>> +
>> + /* Restart submissions to the engine after reset */
>> + intel_execlists_restart_submission(engine);
>
> Clumsy. See above as we already have an entry point that can initiate
> the hw.
>
I think this is a separate step from init_hw().
In init_hw() we initialize hw state and keep it ready then we feed
submissions.
>> +enable_engine:
>> + /*
>> + * we only need to enable engine if we cannot prepare engine for
>> + * reset or reset fails. If the reset is successful, engine gets
>> + * enabled automatically so we can skip this step.
>> + */
>> + if (ret)
>
> Then why is the normal path coming through here?
>
>> + intel_clear_reset_request(engine);
>> +
there is no harm in doing this even when reset is successful, hence
included in the normal path. I will remove ret check to avoid confusion.
>> +out:
>> + /* Wake up anything waiting on this engine's queue */
>> + intel_engine_wakeup(engine);
>
> ? Random call with incorrect locking. What do you think you are
> achieving here because it's not what the comment implies.
sorry, I will come back on this one, I need to better understand few
more things.
>
>> + intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
>>
>> return ret;
>> }
>> +/*
>> + * On gen8+ a reset request has to be issued via the reset control register
>> + * before a GPU engine can be reset in order to stop the command streamer
>> + * and idle the engine. This replaces the legacy way of stopping an engine
>> + * by writing to the stop ring bit in the MI_MODE register.
>> + */
>> +int intel_request_for_reset(struct intel_engine_cs *engine)
>
> intel_engine_reset_begin()
>
> It would be preferrable if we can avoid using request here as we are
> talking about reseting the hung request elsewhere in this chain.
>
the term request is overloaded but here infact we are actually
requesting the hw to prepare for reset and wait for it to acknowledge.
intel_engine_reset_begin() sounds fine.
>> +{
>> + if (!intel_has_engine_reset(engine->i915)) {
>> + DRM_ERROR("Engine Reset not supported on Gen%d\n",
>> + INTEL_INFO(engine->i915)->gen);
>
> ERROR?
This msg is probably unnecessary as we go for engine reset only when it
is available.
>
>> + return -EINVAL;
> return -ENODEV;
ok.
>
>> + }
>> +
>> + return gen8_request_engine_reset(engine);
>> +}
>
> gen8_engine_reset_begin()
>
>> +/*
>> + * It is possible to back off from a previously issued reset request by simply
>> + * clearing the reset request bit in the reset control register.
>> + */
>> +int intel_clear_reset_request(struct intel_engine_cs *engine)
>
> intel_engine_reset_cancel()
>
>> +{
>> +{
>> + if (!intel_has_engine_reset(engine->i915)) {
>> + DRM_ERROR("Request to clear reset not supported on Gen%d\n",
>> + INTEL_INFO(engine->i915)->gen);
>> + return -EINVAL;
>> + }
>> +
>> + gen8_unrequest_engine_reset(engine);
>
> gen8_engine_reset_cancel()
>
> (only suggestions for avoiding having confusion over requests)
ok, thanks for the renaming suggestions.
regards
Arun
>
>
More information about the Intel-gfx
mailing list