[Intel-gfx] [RFC 03/11] drm/i915: Add reset stats entry point for per-engine reset.

Tue Jun 16 08:55:34 PDT 2015

On Tue, Jun 16, 2015 at 02:54:49PM +0100, Chris Wilson wrote:
> On Tue, Jun 16, 2015 at 03:48:09PM +0200, Daniel Vetter wrote:
> > On Mon, Jun 08, 2015 at 06:33:59PM +0100, Chris Wilson wrote:
> > > On Mon, Jun 08, 2015 at 06:03:21PM +0100, Tomas Elf wrote:
> > > > In preparation for per-engine reset add way for setting context reset stats.
> > > > 
> > > > OPEN QUESTIONS:
> > > > 1. How do we deal with get_reset_stats and the GL robustness interface when
> > > > introducing per-engine resets?
> > > > 
> > > > 	a. Do we set context that cause per-engine resets as guilty? If so, how
> > > > 	does this affect context banning?
> > > 
> > > Yes. If the reset works quicker, then we can set a higher threshold for
> > > DoS detection, but we still do need Dos detection?
> > >  
> > > > 	b. Do we extend the publically available reset stats to also contain
> > > > 	per-engine reset statistics? If so, would this break the ABI?
> > > 
> > > No. The get_reset_stats is targetted at the GL API and describing it in
> > > terms of whether my context is guilty or has been affected. That is
> > > orthogonal to whether the reset was on a single ring or the entire GPU -
> > > the question is how broad do want the "affected" to be. Ideally a
> > > per-context reset wouldn't necessarily impact others, except for the
> > > surfaces shared between them...
> > 
> > gl computes sharing sets itself, the kernel only tells it whether a given
> > context has been victimized, i.e. one of it's batches was not properly
> > executed due to reset after a hang.
> 
> So you don't think we should delete all pending requests that depend
> upon state from the hung request?

Tbh I haven't fully thought through what happens with partial resets.
Looking into the future with hardware faulting/svm it's clear that soonish
the kernel won't even be in a position to know depencies. And userspace
already needs to take any kind of texture sharing into account when
computing certain arb_robustness values.

Given that I'm leaning towards a lean implementation in the kernel of only
marking the actual victim batches/contexts and simply continuing to
execute everything else. That has a bit the risk of ending up in continual
resets if a bit of corruption causes all follow-up batches to fail, but
that's something we need to be able to handle (using a full-blown reset
where we throw away all the batches) anyway. And eventually even
escalating to refusing gpu accesses to repeat offenders.

But definitely something we need to decide upon, and something which needs
to be carefully tested with nasty igts for all corner cases. And
preferrably also at least some basic multi-context testcases on top of
mesa/libva robustness.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch