[Intel-gfx] [PATCH 09/13] drm/i915: add reset_state for hw_contexts
Ian Romanick
idr at freedesktop.org
Thu Feb 28 02:15:08 CET 2013
On 02/27/2013 01:13 AM, Chris Wilson wrote:
> On Tue, Feb 26, 2013 at 05:47:12PM -0800, Ian Romanick wrote:
>> On 02/26/2013 03:05 AM, Mika Kuoppala wrote:
>>> For arb-robustness, every context needs to have it's own
>>> reset state tracking. Default context will be handled in a identical
>>> way as the no-context case in further down in the patch set.
>>> For no-context case, the reset state will be stored in
>>> the file_priv part.
>>>
>>> v2: handle default context inside get_reset_state
>>
>> This isn't the interface we want. I already sent you the patches
>> for Mesa, and you seem to have completely ignored them. Moreover,
>> this interface provides no mechanism to query for its existence
>> (other than relying on the kernel version), and no method to
>> deprecate it.
>
> It's existence, and the existence of any successor, is the only test you
> need to check for the interface. Testing the flag field up front also
> lets you know if individual flags are known prior to use.
>
>> Based on e-mail discussions, I think danvet agrees with me here.
>>
>> Putting guilty / innocent counting in kernel puts policy decisions
>> in the kernel that belong with the user space API that implements
>> them. Putting these choices in the kernel significantly decreases
>> how "future proof" the interface is. Since any kernel/user
>> interface has to be kept forever, this is just asking for
>> maintenance trouble down the road.
>
> I think you have the policy standpoint reversed.
Can you elaborate? I've been out of the kernel loop for a long time,
but I always thought policy and mechanism were supposed to be separated.
In the case of drivers with a user-mode component, the mechanism was
in the kernel (where else could it be?), and policy decisions were in
user-mode. This is often at odds with keeping the interfaces between
kernel and user thin, so that may be where my misunderstanding is.
>> Also, it's difficult (for me) to tell from the rest of the series
>> whether or not a context that was not affected by a reset (had no
>> batches in flight at all) can still observe that a reset occurred.
>
> Yes, the context can observe that the global reset increased, its did
> not.
See below for a couple reasons why I think this may be a mistake.
>> From the GL point of view, once a context has observed a reset, it's
>> garbage. Knowing that it saw 1 reset or 43,000,000 resets is the
>> same. Since it's a count, GL will have to query the values before
>> any rendering happens or it won't know whether the value 6 means
>> there was a reset or not.
>>
>> The right interface is a set of flags indicating the state the
>> context was in when it observed a reset. As this patch series
>> stands, Mesa will not use this interface.
>
> This interface conforms to your specifications and interface that you
> last posted to the list and were revised on list. There are two
> substantive differences; there is *less* policy in the kernel than
> before, and the breadcrumb of last reset state was overlooked.
>
> In order to make the reset status work in a multi-threaded environment
> the kernel exists in, we have to assume that there will be multiple
> actors on the reset status, and they will be comparing the state that
> they are interested in. The set of flags that GL understands is
> deducible from this interface.
We had some discussion on the list about a proposed interface last year.
We had a really good discussion, and both you and Daniel provided some
really valuable feedback. The biggest piece of feedback that both of
you provided was to simplify. I took that advice to heart. The
original interface was clunky and over complicated.
If memory serves, that discussion was on the order of 8 to 10 months
ago. Quite a lot has happened since then. The biggest thing that
happened is I started implementing the interface we discussed in the
kernel, libdrm, and Mesa. In that process I found a number of problems
with even the simplified interface. Back in September I used this new
data to simplify the interface even further. This allowed both the
kernel and the user-mode code to be much less complex.
Right about that same time the whole Mesa team got super busy. First we
had OpenGL 3.1, then we had OpenGL ES 3.0. During that time all of my
ARB_robustness work languished. I didn't want to send another round of
not-fully-baked ideas (or patches) to the list, so they just sat. And
sat. And sat. :(
At FOSDEM earlier this month, I discussed this with Jesse. I told him
that there was approximately epsilon probability that I would get back
to this work. As a result, I was going to have to throw it over the
wall to the kernel team. At no point did he mention that anyone was
already working on this.
About a week ago Daniel pulled me into a discussion with Mika about the
ARB_robustness work. I dug out my old code, paged back in the memory of
the work from off-line storage, and provided a bunch of feedback to Mika
and Daniel. After I sent around my explanation of why I abandoned the
counting interface and circulated my work, there was no response from
Mika or anyone on that team. I had a brief discussion with Daniel on
IRC, and he seemed to agree with my sentiments and like the direction of
my code.
Then this patch series appeared on the list, and here we are.
There are two requirements in the ARB_robustness extension (and
additional layered extensions) that I believe cause problems with all of
the proposed counting interfaces.
1. GL contexts that are not affected by a reset (i.e., didn't lose any
objects, didn't lose any rendering, etc.) should not observe it. This
is the big one that all the browser vendors want. If you have 5 tabs
open, a reset caused by a malicious WebGL app in one tab shouldn't, if
possible, force them to rebuild all of their state for all the other tabs.
2. After a GL context observes a reset, it is garbage. It must be
destroyed and a new one created. This means we only need to know that a
reset of any variety affected the context.
In addition, I'm concerned about having one GL context be able to
observe, in any manner, resets that did not affect it. Given that
people have figured out how to get encryption keys by observing cache
timings, it's not impossible for a malicious program to use information
about reset counts to do something. Leaving a potential gap like this
in an extension that's used to improved security seems... like we're
just asking for a Phoronix article. We don't have any requirement to
expose this information, so I don't think we should.
We also don't have any requirement to expose this functionality on
anything pre-Sandy Bridge. We may want, at some point, to expose it on
Ironlake. Based on that, it seems reasonable to tie the availability of
reset notifications to hardware contexts. Since we won't test it, Mesa
certainly isn't going to expose it on 8xx or 915.
Based on this, a simplified interface became obvious: for each hardware
context, track a mask of kinds of resets that have affected that
context. I defined three bits in the mask:
1. The hw context had a batch executing when the reset occurred.
2. The hw context had a batch pending when the reset occurred.
Hopefully in the future we could make most occurrences of this go away
by re-queuing the batch. It may also be possible to do this in
user-mode, but that may be more difficult.
3. "Other."
There's also the potential to add other bits to communicate information
about lost buffers, textures, etc. This could be used to create a
layered extension that lets applications transfer data from the dead
context to the new context. Browsers may not be too interested in this,
but I could imagine compositors or other kinds of applications
benefiting. We may never implement that, but it's easier to communicate
this through a set of flags than through a set of counts.
The guilty / innocent counts in this patch series lose information. We
can probably reconstruct the current flags from the counts, but we would
have to do some refactoring (or additional counting) to get other flags
in the future. All of this would make the kernel code more complex.
Why reconstruct the data you want when you could just track that data in
the first place?
Since the functionality in not available on all hardware, I also added a
query to determine the availability. This has the added benefit that,
should this interface prove to be insufficient in the future, we could
deprecate it by having the query always return false. All software
involved has to be able the handle the case where the interface is not
available, so I don't think this should run awry of the rules about
kernel/user interface breakage. Please correct me if I'm wrong.
The Mesa patches are in my robustness2 branch:
http://cgit.freedesktop.org/~idr/mesa/log/?h=robustness2
As Mika pointed out, the patch that makes use of the new kernel
interface is:
http://cgit.freedesktop.org/~idr/mesa/commit/?h=robustness2&id=ef97f4092c703afd49b3516b15dfbfcd27d4bc97
In addition,
http://cgit.freedesktop.org/~idr/mesa/commit/?h=robustness2&id=6f73cbeecb90b0a08b0f016e19a16f6f50a1a99c
changes to brw_context.c to use intel_get_boolean to determine the
availability of the kernel query. I'd really like to only expose
__DRI2_ROBUSTNESS when the kernel query exists, but I don't see a way to
do that with the DRI extension mechanism.
The libdrm patches are in master of:
http://cgit.freedesktop.org/~idr/drm
I had attacked the problem from the opposite end from Mika. My initial
implementation always reported "other" because a lot of other work
needed to happen to differentiate the other cases. It's good to see
that work progressing. My really, really, really incomplete kernel
patch is at:
http://people.freedesktop.org/~idr/robust.patch
This patch was against some kernel from early September 2012. It may
not apply or build at this point. I believe I was able to observe a
reset by having an app run a shader with an infinite loop and poll
glGetGraphicsResetStatusARB. I couldn't find that code, so I may be
misremembering. With the exception of the I915_PARAM_HAS_RESET_QUERY
handling, nearly all of this patch is probably useless at this point.
More information about the Intel-gfx
mailing list