[Intel-gfx] [RFC] GPU reset notification interface

Thu Jul 19 05:47:22 CEST 2012

On Wed, 18 Jul 2012 18:55:08 +0200
Daniel Vetter <daniel at ffwll.ch> wrote:

> On Wed, Jul 18, 2012 at 09:23:46AM -0700, Ian Romanick wrote:
> > On 07/18/2012 02:20 AM, Daniel Vetter wrote:
> > >- The "all contexts in a share group need to receive a reset notification"
> > >   wording is irking me a bit because we currently only track all the
> > >   actively used things. So if another context in that share group isn't
> > >   affected (i.e. doesn't even use any of the potentially corrupted bos),
> > >   is the idea that the kernel grows the required tracking, or should
> > >   userspace always ask the reset state for all contexts and do the math
> > >   itself?
> > 
> > There are a couple reasons that all contexts in a share group need
> > to get the reset notification.  Consider an application with two
> > contexts, A and B.  Context A is a worker context that does a bunch
> > of render-to-texture operations, and context B will eventually
> > consume those textures.  If context A receives a reset, context B,
> > even if it hasn't done any rendering in five minutes, has lost data.
> > 
> > The kernel should never have any knowledge about GL share groups.
> > This is where my simplifying assumption (that all contexts in a
> > share group share an address space and an fd) comes in handy.  If
> > context A queries the reset status from the kernel first, it can
> > reach over and poke the reset status of context B (in the gl_context
> > structure).  Likewise, if context B queries the kernel first, it can
> > see that another kernel context in its GL context share group got
> > reset.
> > 
> > >- The above essentially links in with how we blame the guilt and figure
> > >   out who's affected. Especially when there's no hw context around, e.g.
> > >   on the blitter or bsd rings, or when an object is used by another
> > >   process and gets corrupted there. Does the spec make any guarantees for
> > >   such a case? Or should we just blame an unknown_reset for all the
> > >   contexts that belong to the fd that submitted the bad batch.
> > 
> > That sounds right.  If we can't assess innocence or guilt, we should
> > say guilt is unknown.  There are some GL commands that get generate
> > blits, but I don't think there's anything that can get commands on
> > the BSD ring.  That's just for media, right?
> > 
> > >As an idea for the above two issues, what about the kernel also keeps a
> > >per-fd reset statistics, which will also be returned with this ioctl?
> > >We'd then restrict the meaning of the ctx fields to only mean "and the
> > >context was actually active". Userspace could then wrap all the per-fd
> > >hang reports into reset_unknown for arb_robustness, but I think this would
> > >be a bit more flexible for other userspace.
> > 
> > Ah, right.  So the DDX or libva could detect resets that affect
> > them. That's reasonable.
> > 
> > >struct drm_context_reset_counts {
> > >	__u32 ctx_id;
> > >
> > >	/**
> > >          * Index of the most recent reset where this context was
> > >	 * guilty.  Zero if none.
> > >          */
> > >	__u32 ctx_guilty;
> > >
> > >	/**
> > >          * Index of the most recent reset where this context was active,
> > >	 * not guilty.  Zero if none.
> > >          */
> > >	__u32 ctx_not_guilty;
> > >
> > >	/**
> > >          * Index of the most recent reset where this context was active,
> > >	 * but guilt was unknown. Zero if none.
> > >          */
> > >	__u32 ctx_unknown_guilt;
> > >
> > >	/**
> > >          * Index of the most recent reset where any batchbuffer submitted
> > >	 * through this fd was guilty.  Zero if none.
> > >          */
> > >	__u32 fd_guilty;
> > >
> > >	/**
> > >          * Index of the most recent reset where any batchbuffer submitted
> > >	 * through this fd was not guilty, but affected.  Zero if none.
> > >          */
> > >	__u32 fd_not_guilty;
> > >
> > >	/**
> > >          * Index of the most recent reset where any batchbuffer submitted
> > >	 * through this fd was affected, but no guilt for the hang could
> > >	 * be assigned.  Zero if none.
> > >          */
> > >	__u32 fd_unknown_guilt;
> > 
> > Since these three fields are per-fd, shouldn't they go in the
> > proposed drm_reset_counts structure instead?  If we do that, it
> > might be better to split into two queries.  One for the per-fd
> > information, and one for the detailed per-context information.  If
> > we expect the common case to be no-reset, user space could
> 
> Oh, I've missed a bit that your ioctl would read out the state for all
> contexts on a given fd. My idea was that this is all the data that the
> ioctl retuns (and it would only fill out the ctx fields if you spec a
> nonzero context). Userspace would then only need to check whether the
> context has a new reset number somewhere (= context suffered from a gpu
> reset while it was actively used) and then also check whether the fd
> suffered from a reset (= something in that gl share group is potentially
> corrupted), mapping it to either victim_reset or unknown_reset (depending
> upon what exactly the spec wants).
> 
> Less complexity in userspace, and imo not too much fuss for the kernel to
> do the book-keeping: Instead of only blaming contexts, we also blame
> file_privs. That would be a generally useful feature to e.g. cut off
> certain fds selectively from the gpu because they hang it too fast - i.e.
> to avoid the dreaded global "hanging to fast" when developing drivers ;-)

I don't think it's bad to know the fd, but I think it will help us
overall if we switch to a more abstract context notion, and try really
hard to blame that. The best way to do it is to enforce it in the API.
If a gpu client isn't using context, then they get the shitty answer (I
don't know wtf happened) from this IOCATL. A bit more on it below/

> 
> > >};
> > >
> > >The above could also be important to know when a hw context is toast and
> > >should better not be used any more.
> > >
> > >- I like Chris' suggestion of using our seqno breadcrumbs for this. If you
> > >   also add a small extension to execbuf to return the seqno of the new
> > >   batch, you should also have all the tools in place to implement your
> > >   extension to notify userspace up to which glFence things have completed.
> > >   Note though that returning the seqno in execbuffer is only correct once
> > >   we've eliminated the flushing_list.
> > 
> > It's possible, but I need to finish working out that idea (see
> > below). I think the context only needs one seqno, not one per
> > guiltiness level.  "This is the last seqno that was retired before
> > this context lost some data."
> 
> The idea was to use the existing seqno as the reset counter in your ioctl.
> It might wrap around, but you need to emit quite a few commands to get to
> that point ;-) So we could use that right away.
> 
> > That may still leave the context in a weird state.  Think about this
> > timeline:
> > 
> >     Application      GPU
> >     draw A
> >                      submit A
> >     draw B
> >                      submit B
> >                      drawing A completes
> >                      reset (B is lost)
> >     draw C
> > 	             submit C
> >                      drawing C completes
> >     query reset
> > 
> > If / when we implement this feature, the kernel may need to drop any
> > work submitted between a reset and an ack of the reset.  Dunno.
> 
> Well, we could simply add a robust_context flag, set at create time. If
> such a context suffers from a gpu reset (while it's active) the kernel
> could refuse to queue up more batches with -EIO. Userspace could then
> inquire the kernel with your ioctl about what's going on. Similarly we
> could add such a flag to the fd, to catch everyone else not using hw
> contexts. There we need a way to clear that flag though, so that you can
> keep on using all the bo textures that survived (whereas hw contexts are
> pretty much worthless on their own after reset, so we could demand that
> they be destroyed).

First off, I don't think hw contexts are worthless after a reset. The
unaffected contexts would still be plenty valuable. Even the
context of the hung context is still valid, just the state of the
hung batch won't be present. I think also you need to stop considering
userpsapce not using HW contexts. Not because it's awesome, but
because there is no downside to using contexts. (I would have loved to
have an abstract context for all rings, but you shot that down pretty
quickly earlier even though I think it was originally your idea). I
also think there is never a case where we want to prevent submissions
for an fd when we can simply prevent submissions for a context. The
chrome/firefox examples you mention below are a perfect example.

In short, I don't understand why we wouldn't want a context over an fd,
it's really not a lot more work to track.

> 
> > >- Last but not least, how often is userspace expected to call this ioctl?
> > >   Around once per patchbuffer submission or way more/less?
> > 
> > I don't know.  This is a fairly new extension, and there are few
> > users.  As far as I'm aware, only Chrome and Firefox use it.  I can
> > find out some details from them.  My guess is somewhere between once
> > per frame and once per draw call.
> 
> Cheers, Daniel

-- 
Ben Widawsky, Intel Open Source Technology Center