[Intel-gfx] [PATCH] drm/i915/lrc: Scrub the GPU state of the guilty hanging request

Fri Apr 27 20:20:25 UTC 2018

Quoting Michel Thierry (2018-04-27 21:12:38)
> On 4/27/2018 12:32 PM, Chris Wilson wrote:
> > Previously, we just reset the ring register in the context image such
> > that we could skip over the broken batch and emit the closing
> > breadcrumb. However, on resume the context image and GPU state would be
> > reloaded, which may have been left in an inconsistent state by the
> > reset. The presumption was that at worst it would just cause another
> > reset and skip again until it recovered, however it seems just as likely
> > to cause an unrecoverable hang. Instead of risking loading an incomplete
> > context image, restore it back to the default state.
> > 
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
> > Cc: Michał Winiarski <michal.winiarski at intel.com>
> > Cc: Michel Thierry <michel.thierry at intel.com>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> > ---
> >   drivers/gpu/drm/i915/intel_lrc.c | 24 +++++++++++++++++-------
> >   1 file changed, 17 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> > index ce23d5116482..422b05290ed6 100644
> > --- a/drivers/gpu/drm/i915/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> > @@ -1804,8 +1804,8 @@ static void reset_common_ring(struct intel_engine_cs *engine,
> >                             struct i915_request *request)
> >   {
> >       struct intel_engine_execlists * const execlists = &engine->execlists;
> > -     struct intel_context *ce;
> >       unsigned long flags;
> > +     u32 *regs;
> >   
> >       GEM_TRACE("%s request global=%x, current=%d\n",
> >                 engine->name, request ? request->global_seqno : 0,
> > @@ -1855,14 +1855,24 @@ static void reset_common_ring(struct intel_engine_cs *engine,
> >        * future request will be after userspace has had the opportunity
> >        * to recreate its own state.
> >        */
> > -     ce = &request->ctx->engine[engine->id];
> > -     execlists_init_reg_state(ce->lrc_reg_state,
> > -                              request->ctx, engine, ce->ring);
> > +     regs = request->ctx->engine[engine->id].lrc_reg_state;
> > +     if (engine->default_state) {
> > +             void *defaults;
> > +
> > +             defaults = i915_gem_object_pin_map(engine->default_state,
> > +                                                I915_MAP_WB);
> > +             if (!IS_ERR(defaults)) {
> > +                     memcpy(regs,
> > +                            defaults + LRC_HEADER_PAGES * PAGE_SIZE,
> > +                            engine->context_size);
> Hi,
> 
> The context_size is taking into count the PP_HWSP page, do we also need 
> to rewrite the PP_HSWP? (or just the logical state).
> 
> Also regs is already pointing to the start of the logical state
> (vaddr + LRC_STATE_PN * PAGE_SIZE).

Yeah, I was aiming for just the register state, and had a nice little
off by one in comparing the macros.

> So if we want to overwrite from the PP_HWSP, then regs is not the right 
> offset, or if we only want to change the logical state then it should be 
> from 'defaults +  LRC_STATE_PN * PAGE_SIZE'.

Right, I don't think we need to scrub the HWSP, just the register state.
The context is lost at this point, and what I want to protect is the
read of the image following the reset. Afaik, we don't issue any reads
from PPHWSP.
-Chris