[Intel-gfx] [RFC 03/11] drm/i915: Add reset stats entry point for per-engine reset.

Thu Jun 18 04:12:00 PDT 2015

On 16/06/15 14:54, Chris Wilson wrote:
> On Tue, Jun 16, 2015 at 03:48:09PM +0200, Daniel Vetter wrote:
>> On Mon, Jun 08, 2015 at 06:33:59PM +0100, Chris Wilson wrote:
>>> On Mon, Jun 08, 2015 at 06:03:21PM +0100, Tomas Elf wrote:
>>>> In preparation for per-engine reset add way for setting context reset stats.
>>>>
>>>> OPEN QUESTIONS:
>>>> 1. How do we deal with get_reset_stats and the GL robustness interface when
>>>> introducing per-engine resets?
>>>>
>>>> 	a. Do we set context that cause per-engine resets as guilty? If so, how
>>>> 	does this affect context banning?
>>>
>>> Yes. If the reset works quicker, then we can set a higher threshold for
>>> DoS detection, but we still do need Dos detection?
>>>  
>>>> 	b. Do we extend the publically available reset stats to also contain
>>>> 	per-engine reset statistics? If so, would this break the ABI?
>>>
>>> No. The get_reset_stats is targetted at the GL API and describing it in
>>> terms of whether my context is guilty or has been affected. That is
>>> orthogonal to whether the reset was on a single ring or the entire GPU -
>>> the question is how broad do want the "affected" to be. Ideally a
>>> per-context reset wouldn't necessarily impact others, except for the
>>> surfaces shared between them...
>>
>> gl computes sharing sets itself, the kernel only tells it whether a given
>> context has been victimized, i.e. one of it's batches was not properly
>> executed due to reset after a hang.
> 
> So you don't think we should delete all pending requests that depend
> upon state from the hung request?
> -Chris

John Harrison & I discussed this yesterday; he's against doing so (even
though the scheduler is ideally placed to do it, if that were actually
the preferred policy). The primary argument (as I see it) is that you
actually don't and can't know the nature of an apparent dependency
between batches that share a buffer object. There are at least three cases:

1. "tightly-coupled": the dependent batch is going to rely on data
produced by the earlier batch. In this case, GIGO applies and the
results will be undefined, possibly including a further hang. Subsequent
batches presumably belong to the same or a closely-related
(co-operating) task, and killing them might be a reasonable strategy here.

2. "loosely-coupled": the dependent batch is going to access the data,
but not in any way that depends on the content (for example, blitting a
rectangle into a composition buffer). The result will be wrong, but only
in a limited way (e.g. window belonging to the faulty application will
appear corrupted). The dependent batches may well belong to unrelated
system tasks (e.g. X or surfaceflinger) and killing them is probably not
justified.

3. "uncoupled": the dependent batch wants the /buffer/, not the data in
it (most likely a framebuffer or similar object). Any incorrect data in
the buffer is irrelevant. Killing off subsequent batches would be wrong.

Buffer access mode (readonly, read/write, writeonly) might allow us to
distinguish these somewhat, but probably not enough to help make the
right decision. So the default must be *not* to kill off dependants
automatically, but if the failure does propagate in such a way as to
cause further consequent hangs, then the context-banning mechanism
should eventually catch and block all the downstream effects.

.Dave.