[Intel-gfx] [RFC 02/11] drm/i915: Introduce uevent for full GPU reset.

Wed Jun 17 05:51:10 PDT 2015

On Wed, Jun 17, 2015 at 01:49:15PM +0200, Daniel Vetter wrote:
> On Tue, Jun 16, 2015 at 08:33:13PM +0100, Chris Wilson wrote:
> > On Tue, Jun 16, 2015 at 06:32:22PM +0100, Tomas Elf wrote:
> > > On 16/06/2015 17:55, Chris Wilson wrote:
> > > >On Tue, Jun 16, 2015 at 04:43:55PM +0100, Tomas Elf wrote:
> > > >>On 16/06/2015 14:43, Daniel Vetter wrote:
> > > >>>On Mon, Jun 08, 2015 at 06:03:20PM +0100, Tomas Elf wrote:
> > > >>>>The TDR ULT used to validate this patch series requires a special uevent for
> > > >>>>full GPU resets in order to distinguish between different kinds of resets.
> > > >>>>
> > > >>>>Signed-off-by: Tomas Elf <tomas.elf at intel.com>
> > > >>>
> > > >>>Why duplicate the uevent we send out from i915_reset_and_wakeup? At least
> > > >>>I can't spot what this gets us in addition to the existing one.
> > > >>>-Daniel
> > > >>
> > > >>Look at this line:
> > > >>>>+		reset_event[0] = kasprintf(GFP_KERNEL, "%s", "GPU RESET=0");
> > > >>
> > > >>It doesn't exist in reset_and_wakeup (specifically, the "GPU
> > > >>RESET=0" part). It's a uevent that happens at the time of the actual
> > > >>GPU reset (GDRST register write). In the subsequent TDR commit we
> > > >>add another one to the point of the actual engine reset, which also
> > > >>includes information about what exact engine was reset.
> > > >>
> > > >>The uevents in reset_and_wakeup only tell the user that an error has
> > > >>been detected and that some kind of reset has happened, these new
> > > >>uevents specify exactly what kind of reset has happened. This
> > > >>particular one on its own it's not very meaningful since there is
> > > >>only one supported form of reset at this point but once we add
> > > >>engine reset support it's useful to be able to discern the types of
> > > >>resets from each other (GPU reset, RCS engine reset, VCS engine
> > > >>reset, VCS2 engine reset, BCS engine reset, VECS engine reset).
> > > >>
> > > >>Does that make sense?
> > > >
> > > >The ultimate question is how do you envisage these uevents being used?
> > > >
> > > >At present, we have abrtd listening out for when to grab the
> > > >/sys/drm/cardX/error and maybe for GL robustness (though I would imagine
> > > >if they thought such events useful we would have had demands for a DRM
> > > >event on the guilty/victim fd).
> > > >
> > > >Does it really make sense to send uevents for both hang, partial-reset,
> > > >and full-reset?
> > > >-Chris
> > > >
> > > 
> > > The reason we have such a detailed set of uevents is primarily for
> > > testing purposes. Our internal VPG tests check for these uevents to
> > > make sure that the expected recovery mode is actually being used.
> > > Which makes sense, because the TDR driver code contains reset
> > > promotion logic to decide what recovery mode to use and if that
> > > logic somehow gets broken the driver might go with the wrong
> > > recovery mode. Thus, it's worth testing and therefore those uevents
> > > need to be there. Of course, I guess the argument "our internal VPG
> > > tests do this" might not hold water since the tests haven't been
> > > upstreamed? If that's the case then I guess I don't have any opinion
> > > about what uevent goes where and we could go with whatever set of
> > > uevents you prefer.
> > 
> > uevents aren't free, so if we can find a way to get coverage without
> > waking up the buses, I'd be happy (though we shouldn't need that many
> > reset uevents, it is mainly the principle of trying to use the right
> > tool for the job and not just the first one that comes to hand). Also
> > uevents are by their very nature userspace ABI so we should plan
> > carefully how we expect them to be used.
> 
> Yeah uevents are ABI with full backwards compat guarantees, I don't want
> to add ABI just for testing. What we've done for the arb robustness
> testcaes is batches with canary writes to observe whether all the right
> ones have been cancelled/killed depending upon where the hanging batch is.
> 
> Imo that's also more robust testing - we have behavioural expectations,
> tight coupling of the tests with the current kernel implementations only
> causes long-term maintainance headaches.
> 
> E.g. testcase that expects a engine reset:
> - Submit dummo workload on ring A followed by a canary write.
> - Submit hanging batch on ring B.
> - Wait for reset.
> - Check that canary write in ring A happened.
> 
> Testcase which expects full-blown gpu reset:
> - Check that canary write didn't happen.

Note that this presumes that we don't attempt to restart the rings with
pending requests (which we could and once upon were working on).

For more examples of the former, see gem_concurrent_all,
gem_reloc_vs_gpu, gem_evict_everything. These are coarse correctness
tests of different parts of the API and not the targetted TDR coverage
tests.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre