[Intel-gfx] [PATCH 2/3] drm/i915/guc: Add support for reset engine using GuC commands

Chris Wilson chris at chris-wilson.co.uk
Tue Oct 31 10:17:19 UTC 2017


Quoting Michel Thierry (2017-10-31 04:38:30)
> On 30/10/17 14:09, Chris Wilson wrote:
> > Quoting Michel Thierry (2017-10-30 18:56:15)
> >> This patch adds per engine reset and recovery (TDR) support when GuC is
> >> used to submit workloads to GPU.
> >>
> >> In the case of i915 directly submission to ELSP, driver manages hang
> >> detection, recovery and resubmission. With GuC submission these tasks
> >> are shared between driver and GuC. i915 is still responsible for detecting
> >> a hang, and when it does it only requests GuC to reset that Engine. GuC
> >> internally manages acquiring forcewake and idling the engine before actually
> >> resetting it.
> >>
> >> Once the reset is successful, i915 takes over again and handles resubmission.
> >> The scheduler in i915 knows which requests are pending so after resetting
> >> a engine, pending workloads/requests are resubmitted again.
> >>
> >> v2: s/i915_guc_request_engine_reset/i915_guc_reset_engine/ to match the
> >> non-guc funtion names.
> >>
> >> v3: Removed debug message about engine restarting from which request,
> >> since the new baseline do it regardless of submission mode. (Chris)
> >>
> >> v4: Rebase.
> >>
> >> v5: Do not pass unnecessary reporting flags to the fw (Jeff);
> >> tasklet_schedule(&execlists->irq_tasklet) handles the resubmit; rebase.
> > 
> > In your experience, how did our test coverage fare?
> > 
> > Could you use live_hangcheck effectively? (The drv_selftest would need
> > some hand holding to pass along guc options. But for livetesting we
> > should probably get to the point of being able to load/unload the guc
> > interface so that we cover both execlists and guc.) Did you find
> > "gem_exec_whisper --r hang*", did you try gem_concurrent_all?
> >   
> 
> live_hangcheck runs ok with guc (as long as i915_params.h has guc 
> submission enabled). Do you see a benefit on adding an option in 
> drv_selftest to override the submission mode? I can add it to my list.

I'd rather take the path of loading/unloading guc in the kernel to
ensure we have coverage of both on all platforms where that makes sense.
That way the kernel remains in control of the coverage it deems
necessary.

The alternative would be a for_each_param_value() loop in drv_selftests
which doesn't inspired me with forward/backward testing confidence.

For the time being, knowing that you did check live_hangcheck is enough
for me to cross it off the checklist.

The larger question behind "how did our test coverage fare" is did it
find any bugs during development? If not, we need more tests ;)
 
> You got me in gem_concurrent_all, I forgot to schedule it a few weeks ago.
> 
> >> +/**
> >> + * intel_guc_reset_engine() - ask GuC to reset an engine
> >> + * @engine:    engine to be reset
> >> + */
> >> +int intel_guc_reset_engine(struct intel_engine_cs *engine)
> >> +{
> >> +       struct drm_i915_private *dev_priv = engine->i915;
> >> +       struct intel_guc *guc = &dev_priv->guc;
> >> +       u32 data[7];
> >> +
> >> +       GEM_BUG_ON(!guc->execbuf_client);
> >> +
> >> +       data[0] = INTEL_GUC_ACTION_REQUEST_ENGINE_RESET;
> >> +       data[1] = engine->guc_id;
> >> +       data[2] = 0;
> >> +       data[3] = 0;
> >> +       data[4] = 0;
> >> +       data[5] = guc->execbuf_client->stage_id;
> >> +       data[6] = guc_ggtt_offset(guc->shared_data);
> >> +
> >> +       return intel_guc_send(guc, data, ARRAY_SIZE(data));
> > 
> > Is this a synchronous action? We expect that following the completion of
> > the reset routine, we are ready to reinit the hw. The same rule needs to
> > apply the guc, I think.
> 
> Right now the action is synchronous, the fw won't reply to the action 
> until all the steps are completed. It also is fast enough, I haven't 
> seen it time out (which would be promoted to full reset and reload the 
> fw). But, do you have a crystal ball?

Just REQUEST is a red flag that there may be an asynchronous REPLY/ACK.

Aside: we need to start planning/adding fault_injection tests.
intel_guc_send() seems a good choice for an if(should_fail()).
-Chris


More information about the Intel-gfx mailing list