[Intel-gfx] [PATCH v3 00/16] arb robustness enablers v3

Thu Apr 25 01:27:15 CEST 2013

On Thu, Apr 04, 2013 at 06:32:32PM +0300, Mika Kuoppala wrote:
> Hi,
> 
> I have addressed all the feedback I got from v2 series.
> 
> Most notably the interface has changed.
> 
> Now those users who don't care about the exact counts can use batch_pending
> and batch_active like they would be flags (see i-g-t testcase),
> assuming they destroy the context after hang is dealt with.
> For those users who want more detailed heuristics before deciding some action,
> they can use batch_pending and batch_active as counts.
> 
> Non process context related statistics are only available
> if CAP_SYS_ADMIN is hold.
> 
> Whole series can be found in:
> https://github.com/mkuoppal/linux/commits/arb-robustness
> 
> Test case for i-g-t is in:
> https://github.com/mkuoppal/intel-gpu-tools/commits/arb-robustness
> 
> Thanks for your feedback,
> -Mika

This started as a short email, but grew rather fast... sorry.

I've commented on some stuff up through 5, and I think maybe that was
lost a bit in the noise. So let me restate it here, and once we get
moving on those first 5, I'll pick the review back up.

Patch #1 looks functionally correct, but I don't really think we should
merge it (see comments in ml). It is however:
Reviewed-by: Ben Widawsky <ben at bwidawsk.net>

Patch #2 looks functionally correct, though here too, I don't think it's
worth merging.
Reviewed-by: Ben Widawsky <ben at bwidawsk.net>

I see no reason for, patch #2, and only reasons against patch #1 I've
given my $.02, and I'll leave it up to our maintainer to decide whether
he wants to merge.

Getting patch 3 upstream sooner is really important for enabling PPGTT,
and I think patch 3 is good to go. Patch #3 +
http://cgit.freedesktop.org/~bwidawsk/drm-intel/commit/?h=ppgtt-ctx&id=35da04612650fb78b21811a13f88ad23ea1f64be
is regression testable via the existing context tests, and gives me what
I need for PPGTT (ie. they're already in my series).

Patch #3 is already SOB me, and is good to go.

Patch 4, and 5 change the way in which we detect hangs, which is scary
because we haven't really touched that much in quite a while. I'd like to
see that as just patches by themselves. The idea is solid, I feel, but
I'd like it to sit on it a bit to make sure all GPU clients can deal
with that. As a side note on patch 4, and 5, I'll point out that I think
you really want the GPU watchdog, which we can't use because the blitter
doesn't have one however at some point I believe it is added (maybe the
product that added it has even been released now) - so you should just
keep that in mind. (Shameless plug:
http://lists.freedesktop.org/archives/intel-gfx/2012-July/019012.html)

Assuming nobody screams about their GPU client which had a legitimate 3
seconds of work queued only to be reset, 4, and 5 are:
Reviewed-by: Ben Widawsky <ben at bwidawsk.net>

One thing you could do on 5 to make me a little less skittish is leave
the code in for acthd, and instdone and simply use it for bug triage.
For instance, ACTHD/instdone moving is likely a very different type of hang than
ACTHD/instdone not moving.

In short, here's what I'd recommend:
===================================
Submit 4, and 5 as a separate mini series explaining the benefits of tracking
hangs by seqno only, and what it intends to accomplish. Take my
suggestion about keeping ACTHD or not. Throw that onto dinq soon and
make sure nobody yells. Asign all hangs for the

Merge 1-3 (or drop 1 & 2, and rework 3 if someone agrees about 1 & 2).
Add the link above with the patch from Chris for good measure to catch
regressions.

Also while we're at it,
>   drm/i915: detect hang using per ring hangcheck_score
This patch introduces a warning:
'i915_hangcheck_hung' defined but not used

[snip]
-- 
Ben Widawsky, Intel Open Source Technology Center