[Intel-gfx] [PATCH 09/13] drm/i915: add reset_state for hw_contexts

Thu Feb 28 02:15:08 CET 2013

On 02/27/2013 01:13 AM, Chris Wilson wrote:
> On Tue, Feb 26, 2013 at 05:47:12PM -0800, Ian Romanick wrote:
>> On 02/26/2013 03:05 AM, Mika Kuoppala wrote:
>>> For arb-robustness, every context needs to have it's own
>>> reset state tracking. Default context will be handled in a identical
>>> way as the no-context case in further down in the patch set.
>>> For no-context case, the reset state will be stored in
>>> the file_priv part.
>>>
>>> v2: handle default context inside get_reset_state
>>
>> This isn't the interface we want.  I already sent you the patches
>> for Mesa, and you seem to have completely ignored them.  Moreover,
>> this interface provides no mechanism to query for its existence
>> (other than relying on the kernel version), and no method to
>> deprecate it.
>
> It's existence, and the existence of any successor, is the only test you
> need to check for the interface. Testing the flag field up front also
> lets you know if individual flags are known prior to use.
>
>> Based on e-mail discussions, I think danvet agrees with me here.
>>
>> Putting guilty / innocent counting in kernel puts policy decisions
>> in the kernel that belong with the user space API that implements
>> them. Putting these choices in the kernel significantly decreases
>> how "future proof" the interface is.  Since any kernel/user
>> interface has to be kept forever, this is just asking for
>> maintenance trouble down the road.
>
> I think you have the policy standpoint reversed.

Can you elaborate?  I've been out of the kernel loop for a long time, 
but I always thought policy and mechanism were supposed to be separated. 
  In the case of drivers with a user-mode component, the mechanism was 
in the kernel (where else could it be?), and policy decisions were in 
user-mode.  This is often at odds with keeping the interfaces between 
kernel and user thin, so that may be where my misunderstanding is.

>> Also, it's difficult (for me) to tell from the rest of the series
>> whether or not a context that was not affected by a reset (had no
>> batches in flight at all) can still observe that a reset occurred.
>
> Yes, the context can observe that the global reset increased, its did
> not.

See below for a couple reasons why I think this may be a mistake.

>>  From the GL point of view, once a context has observed a reset, it's
>> garbage.  Knowing that it saw 1 reset or 43,000,000 resets is the
>> same.  Since it's a count, GL will have to query the values before
>> any rendering happens or it won't know whether the value 6 means
>> there was a reset or not.
>>
>> The right interface is a set of flags indicating the state the
>> context was in when it observed a reset.  As this patch series
>> stands, Mesa will not use this interface.
>
> This interface conforms to your specifications and interface that you
> last posted to the list and were revised on list. There are two
> substantive differences; there is *less* policy in the kernel than
> before, and the breadcrumb of last reset state was overlooked.
>
> In order to make the reset status work in a multi-threaded environment
> the kernel exists in, we have to assume that there will be multiple
> actors on the reset status, and they will be comparing the state that
> they are interested in. The set of flags that GL understands is
> deducible from this interface.

We had some discussion on the list about a proposed interface last year. 
  We had a really good discussion, and both you and Daniel provided some 
really valuable feedback.  The biggest piece of feedback that both of 
you provided was to simplify.  I took that advice to heart.  The 
original interface was clunky and over complicated.

If memory serves, that discussion was on the order of 8 to 10 months 
ago.  Quite a lot has happened since then.  The biggest thing that 
happened is I started implementing the interface we discussed in the 
kernel, libdrm, and Mesa.  In that process I found a number of problems 
with even the simplified interface.  Back in September I used this new 
data to simplify the interface even further.  This allowed both the
kernel and the user-mode code to be much less complex.

Right about that same time the whole Mesa team got super busy.  First we 
had OpenGL 3.1, then we had OpenGL ES 3.0.  During that time all of my 
ARB_robustness work languished.  I didn't want to send another round of 
not-fully-baked ideas (or patches) to the list, so they just sat.  And 
sat.  And sat. :(

At FOSDEM earlier this month, I discussed this with Jesse.  I told him 
that there was approximately epsilon probability that I would get back 
to this work.  As a result, I was going to have to throw it over the 
wall to the kernel team.  At no point did he mention that anyone was 
already working on this.

About a week ago Daniel pulled me into a discussion with Mika about the 
ARB_robustness work.  I dug out my old code, paged back in the memory of 
the work from off-line storage, and provided a bunch of feedback to Mika 
and Daniel.  After I sent around my explanation of why I abandoned the 
counting interface and circulated my work, there was no response from 
Mika or anyone on that team.  I had a brief discussion with Daniel on 
IRC, and he seemed to agree with my sentiments and like the direction of 
my code.

Then this patch series appeared on the list, and here we are.

There are two requirements in the ARB_robustness extension (and 
additional layered extensions) that I believe cause problems with all of 
the proposed counting interfaces.

1. GL contexts that are not affected by a reset (i.e., didn't lose any 
objects, didn't lose any rendering, etc.) should not observe it.  This 
is the big one that all the browser vendors want.  If you have 5 tabs 
open, a reset caused by a malicious WebGL app in one tab shouldn't, if 
possible, force them to rebuild all of their state for all the other tabs.

2. After a GL context observes a reset, it is garbage.  It must be 
destroyed and a new one created.  This means we only need to know that a 
reset of any variety affected the context.

In addition, I'm concerned about having one GL context be able to 
observe, in any manner, resets that did not affect it.  Given that 
people have figured out how to get encryption keys by observing cache 
timings, it's not impossible for a malicious program to use information 
about reset counts to do something.  Leaving a potential gap like this 
in an extension that's used to improved security seems... like we're 
just asking for a Phoronix article.  We don't have any requirement to 
expose this information, so I don't think we should.

We also don't have any requirement to expose this functionality on 
anything pre-Sandy Bridge.  We may want, at some point, to expose it on 
Ironlake.  Based on that, it seems reasonable to tie the availability of 
reset notifications to hardware contexts.  Since we won't test it, Mesa 
certainly isn't going to expose it on 8xx or 915.

Based on this, a simplified interface became obvious:  for each hardware 
context, track a mask of kinds of resets that have affected that 
context.  I defined three bits in the mask:

1. The hw context had a batch executing when the reset occurred.

2. The hw context had a batch pending when the reset occurred. 
Hopefully in the future we could make most occurrences of this go away 
by re-queuing the batch.  It may also be possible to do this in 
user-mode, but that may be more difficult.

3. "Other."

There's also the potential to add other bits to communicate information 
about lost buffers, textures, etc.  This could be used to create a 
layered extension that lets applications transfer data from the dead 
context to the new context.  Browsers may not be too interested in this, 
but I could imagine compositors or other kinds of applications 
benefiting.  We may never implement that, but it's easier to communicate 
this through a set of flags than through a set of counts.

The guilty / innocent counts in this patch series lose information.  We 
can probably reconstruct the current flags from the counts, but we would 
have to do some refactoring (or additional counting) to get other flags 
in the future.  All of this would make the kernel code more complex. 
Why reconstruct the data you want when you could just track that data in 
the first place?

Since the functionality in not available on all hardware, I also added a 
query to determine the availability.  This has the added benefit that, 
should this interface prove to be insufficient in the future, we could 
deprecate it by having the query always return false.  All software 
involved has to be able the handle the case where the interface is not 
available, so I don't think this should run awry of the rules about 
kernel/user interface breakage.  Please correct me if I'm wrong.

The Mesa patches are in my robustness2 branch:

http://cgit.freedesktop.org/~idr/mesa/log/?h=robustness2

As Mika pointed out, the patch that makes use of the new kernel 
interface is:

http://cgit.freedesktop.org/~idr/mesa/commit/?h=robustness2&id=ef97f4092c703afd49b3516b15dfbfcd27d4bc97

In addition,

http://cgit.freedesktop.org/~idr/mesa/commit/?h=robustness2&id=6f73cbeecb90b0a08b0f016e19a16f6f50a1a99c

changes to brw_context.c to use intel_get_boolean to determine the 
availability of the kernel query.  I'd really like to only expose 
__DRI2_ROBUSTNESS when the kernel query exists, but I don't see a way to 
do that with the DRI extension mechanism.

The libdrm patches are in master of:

http://cgit.freedesktop.org/~idr/drm

I had attacked the problem from the opposite end from Mika.  My initial 
implementation always reported "other" because a lot of other work 
needed to happen to differentiate the other cases.  It's good to see 
that work progressing.  My really, really, really incomplete kernel 
patch is at:

http://people.freedesktop.org/~idr/robust.patch

This patch was against some kernel from early September 2012.  It may 
not apply or build at this point.  I believe I was able to observe a 
reset by having an app run a shader with an infinite loop and poll 
glGetGraphicsResetStatusARB.  I couldn't find that code, so I may be 
misremembering.  With the exception of the I915_PARAM_HAS_RESET_QUERY 
handling, nearly all of this patch is probably useless at this point.