[Intel-gfx] [PATCH] [RFC] drm/i915: Generate a hang error code

Wed Feb 5 17:18:30 CET 2014

On Wed, Feb 05, 2014 at 04:03:45PM +0000, Jesse Barnes wrote:
> On Wed, 5 Feb 2014 16:15:02 +0100
> Daniel Vetter <daniel at ffwll.ch> wrote:
> 
> > On Wed, Feb 05, 2014 at 02:59:08PM +0000, Jesse Barnes wrote:
> > > On Tue,  4 Feb 2014 12:18:55 +0000
> > > Ben Widawsky <benjamin.widawsky at intel.com> wrote:
> > > 
> > > > We get a large number of bugs which have a, "hey I have that too"
> > > > because they see a GPU hang in dmesg. While two machines of the same
> > > > model having a GPU hang is indeed a coincidence, it is far from enough
> > > > evidence to suggest they are the same.
> > > > 
> > > > In order to reduce this effect, and hopefully get people to file new bug
> > > > reports, clearly the error message itself has been insufficient (see ref
> > > > at the bottom for a new bug report with this characteristic).
> > > > 
> > > > The algorithm is purposely pretty naive. I don't think we need much in
> > > > order to avoid the problem I am trying to solve, and keeping it naive
> > > > gives us some ability to make a decent test case.
> > > 
> > > I like the direction of this.  If we can get some basic info into the
> > > dmesg part of things (the only part regular users will actually look
> > > at) we can probably avoid some of the "me too" action we see on general
> > > GPU hangs.  Having PID, comm, and some sort of hang signature are all
> > > good steps in that direction imo.
> > 
> > tbh I don't see much value in regular users trying to triage gpu hang. If
> > they're not damn sure that they have a dupe (which means same platform,
> > versions of the software stack and crashing games) I much prefer if they
> > just send in a duplicate bug for us to triage.
> > 
> > With the mis-design of bugzilla it's much harder to untangle a wrong
> > me-too than mark something as duplicate. And especially long-running bugs
> > are a royal pain if there's too much wrong me-too noise in there.
> > 
> > Not a comment on the patch itself, just a general comment wrt avoiding
> > me-too gpu hang reports.
> 
> So you're saying the GPU error decode tool should create a bug template
> for people so we don't get the "me too" reports?
> 
> What I see above is that it's really important to avoid the "me too"
> stuff, and to do it in such a way that false positives are minimized
> (e.g. the IPEHR bit Ubuntu used to use).  So I guess I don't see what's
> unconvincing here.  Today we have no way of differentiating w/o digging
> in to the error record, which users definitely won't do, and this patch
> seems like it could only help with that... so count me confused.

We have a full paragraph explaining to users exactly what they need to do.
They still me-too and fail to attach the error state. I don't how adding
even more helps, since it never really did.

Anyway, patch merged since meh. I'd still like to see the same information
dumped into the error state though.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch