[Intel-gfx] [RFC 02/11] drm/i915: Introduce uevent for full GPU reset.

Wed Jun 17 04:49:15 PDT 2015

On Tue, Jun 16, 2015 at 08:33:13PM +0100, Chris Wilson wrote:
> On Tue, Jun 16, 2015 at 06:32:22PM +0100, Tomas Elf wrote:
> > On 16/06/2015 17:55, Chris Wilson wrote:
> > >On Tue, Jun 16, 2015 at 04:43:55PM +0100, Tomas Elf wrote:
> > >>On 16/06/2015 14:43, Daniel Vetter wrote:
> > >>>On Mon, Jun 08, 2015 at 06:03:20PM +0100, Tomas Elf wrote:
> > >>>>The TDR ULT used to validate this patch series requires a special uevent for
> > >>>>full GPU resets in order to distinguish between different kinds of resets.
> > >>>>
> > >>>>Signed-off-by: Tomas Elf <tomas.elf at intel.com>
> > >>>
> > >>>Why duplicate the uevent we send out from i915_reset_and_wakeup? At least
> > >>>I can't spot what this gets us in addition to the existing one.
> > >>>-Daniel
> > >>
> > >>Look at this line:
> > >>>>+		reset_event[0] = kasprintf(GFP_KERNEL, "%s", "GPU RESET=0");
> > >>
> > >>It doesn't exist in reset_and_wakeup (specifically, the "GPU
> > >>RESET=0" part). It's a uevent that happens at the time of the actual
> > >>GPU reset (GDRST register write). In the subsequent TDR commit we
> > >>add another one to the point of the actual engine reset, which also
> > >>includes information about what exact engine was reset.
> > >>
> > >>The uevents in reset_and_wakeup only tell the user that an error has
> > >>been detected and that some kind of reset has happened, these new
> > >>uevents specify exactly what kind of reset has happened. This
> > >>particular one on its own it's not very meaningful since there is
> > >>only one supported form of reset at this point but once we add
> > >>engine reset support it's useful to be able to discern the types of
> > >>resets from each other (GPU reset, RCS engine reset, VCS engine
> > >>reset, VCS2 engine reset, BCS engine reset, VECS engine reset).
> > >>
> > >>Does that make sense?
> > >
> > >The ultimate question is how do you envisage these uevents being used?
> > >
> > >At present, we have abrtd listening out for when to grab the
> > >/sys/drm/cardX/error and maybe for GL robustness (though I would imagine
> > >if they thought such events useful we would have had demands for a DRM
> > >event on the guilty/victim fd).
> > >
> > >Does it really make sense to send uevents for both hang, partial-reset,
> > >and full-reset?
> > >-Chris
> > >
> > 
> > The reason we have such a detailed set of uevents is primarily for
> > testing purposes. Our internal VPG tests check for these uevents to
> > make sure that the expected recovery mode is actually being used.
> > Which makes sense, because the TDR driver code contains reset
> > promotion logic to decide what recovery mode to use and if that
> > logic somehow gets broken the driver might go with the wrong
> > recovery mode. Thus, it's worth testing and therefore those uevents
> > need to be there. Of course, I guess the argument "our internal VPG
> > tests do this" might not hold water since the tests haven't been
> > upstreamed? If that's the case then I guess I don't have any opinion
> > about what uevent goes where and we could go with whatever set of
> > uevents you prefer.
> 
> uevents aren't free, so if we can find a way to get coverage without
> waking up the buses, I'd be happy (though we shouldn't need that many
> reset uevents, it is mainly the principle of trying to use the right
> tool for the job and not just the first one that comes to hand). Also
> uevents are by their very nature userspace ABI so we should plan
> carefully how we expect them to be used.

Yeah uevents are ABI with full backwards compat guarantees, I don't want
to add ABI just for testing. What we've done for the arb robustness
testcaes is batches with canary writes to observe whether all the right
ones have been cancelled/killed depending upon where the hanging batch is.

Imo that's also more robust testing - we have behavioural expectations,
tight coupling of the tests with the current kernel implementations only
causes long-term maintainance headaches.

E.g. testcase that expects a engine reset:
- Submit dummo workload on ring A followed by a canary write.
- Submit hanging batch on ring B.
- Wait for reset.
- Check that canary write in ring A happened.

Testcase which expects full-blown gpu reset:
- Check that canary write didn't happen.

Variate as need and correlate with the reset stats the kernel reports.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch