<div dir="ltr">On 11 September 2013 07:50, Mika Kuoppala <span dir="ltr"><<a href="mailto:mika.kuoppala@linux.intel.com" target="_blank">mika.kuoppala@linux.intel.com</a>></span> wrote:<br><div class="gmail_extra"><div class="gmail_quote"> <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">Paul Berry <<a href="mailto:stereotype441@gmail.com">stereotype441@gmail.com</a>> writes:<br> <br> > On 10 September 2013 06:16, Mika Kuoppala <<a href="mailto:mika.kuoppala@linux.intel.com">mika.kuoppala@linux.intel.com</a>>wrote:<br> ><br> >> Current policy is to ban context if it manages to hang<br> >> gpu in a certain time windows. Paul Berry asked if more<br> >> strict policy could be available for use cases where<br> >> the application doesn't know if the rendering command stream<br> >> sent to gpu is valid or not.<br> >><br> >> Provide an option, flag on context creation time, to let<br> >> userspace to set more strict policy for handling gpu hangs for<br> >> this context. If context with this flag set ever hangs the gpu,<br> >> it will be permanently banned from accessing the GPU.<br> >> All subsequent batch submissions will return -EIO.<br> >><br> >> Requested-by: Paul Berry <<a href="mailto:stereotype441@gmail.com">stereotype441@gmail.com</a>><br> >> Cc: Paul Berry <<a href="mailto:stereotype441@gmail.com">stereotype441@gmail.com</a>><br> >> Cc: Ben Widawsky <<a href="mailto:ben@bwidawsk.net">ben@bwidawsk.net</a>><br> >> Signed-off-by: Mika Kuoppala <<a href="mailto:mika.kuoppala@intel.com">mika.kuoppala@intel.com</a>><br> >><br> ><br> > (Cc-ing Ian since this impacts ARB_robustness, which he's been working on).<br> ><br> > To clarify my reasons for requesting this feature, it's not necessarily for<br> > use cases where the application doesn't know if the rendering command<br> > stream is valid. Rather, it's for any case where there is the risk of a<br> > GPU hang (this might happen even if the command stream is valid, for<br> > example because of an infinite loop in a WebGL shader). Since the user<br> > mode application (Mesa in my example) assumes that each batch buffer runs<br> > to completion before the next batch buffer runs, it frequently includes<br> > commands in batch buffer N that rely on state established by commands in<br> > batch buffer N-1. If batch buffer N-1 was interrupted due to a GPU hang,<br> > then some of its state updates may not have completed, resulting in a<br> > sizeable risk that batch buffer N (and a potentially unlimited number of<br> > subsequent batches) will produce a GPU hang also. The only reliable way to<br> > recover from this situation is for Mesa to send a new batch buffer that<br> > sets up the GPU state from scratch rather than relying on state established<br> > in previous batch buffers.<br> <br> </div></div>Thanks for the clarification. I have updated the commit message.<br> <div class="im">><br> > Since Mesa doesn't wait for batch buffer N-1 to complete before submitting<br> > batch buffer N, once a GPU hang occurs the kernel must regard any<br> > subsequent buffers as suspect, until it receives some notification from<br> > Mesa that the next batch is going to set up the GPU state from scratch.<br> > When we met in June, we decided that the notification mechanism would be<br> > for Mesa to stop using the context that caused the GPU hang, and create a<br> > new context. The first batch buffer sent to the new context would (of<br> > necessity) set up the GPU state from scratch. Consequently, all the kernel<br> > needs to do to implement the new policy is to permanently ban any context<br> > involved in a GPU hang.<br> <br> </div>Involved as a guilty of hang or ban every context who had batches pending?<br> <br> We could add I915_CONTEXT_BAN_ON_PENDING flag also and with it all contexts<br> that were affected would get -EIO on next batch submission after the hang.<br> <div class="im"><br> > Question, since I'm not terribly familiar with the kernel code: is it<br> > possible for the ring buffer to contain batches belonging to multiple<br> > contexts at a time?<br> <br> </div>Yes.<br> <div class="im"><br> > If so, then what happens if a GPU hang occurs? For<br> > instance, let's say that the ring contains batch A1 from context A followed<br> > by batch B2 from context B. What happens if a GPU hang occurs while<br> > executing batch A1? Ideally the kernel would consider only context A to<br> > have been involved in the GPU hang, and automatically re-submit batch B2 so<br> > that context B it not affected by the hang. Less ideally, but still ok,<br> > would be for the kernel to consider both contexts A and B to be involved in<br> > the GPU hang, and apply both contexts' banning policies. If, however, the<br> > kernel considered only context A to be involved in the GPU hang, but failed<br> > to re-submit batch B2, then that would risk future GPU hangs from context<br> > B, since a future batch B3 from context B would likely rely on state that<br> > should have been established by batch B2.<br> ><br> <br> </div>This patch will only ban the offending context. Other contexts<br> will lose the batches that were pending as the request queue will be<br> cleared on reset following the hang. As things are now, kernel wont<br> re-submit anything by itself.<br></blockquote><div><br></div><div>Thanks for the clarification.<br><br></div><div>The important thing from Mesa's point of view is to make sure that batch N submitted to context C will only be executed if batch N-1 has run to completion. We would like this invariant to hold even if other contexts cause GPU hangs. Under the current state of affairs, where a hang on context A can cause a batch belonging to context B to be lost, we would need the I915_CONTEXT_BAN_ON_PENDING flag in order to achieve that invariant. But if the kernel ever got changed in the future so that it automatically re-submitted pending batches upon recovery from a GPU hang* (a change I would advocate), then we wouldn't need the I915_CONTEXT_BAN_ON_PENDING flag anymore, and in fact setting it would be counterproductive.<br> <br></div><div>(*Of course, in order to avoid cascading GPU hangs, the kernel should only re-submit pending batches from contexts other than the offending context)<br><br></div><div>So I would be in favor of adding a I915_CONTEXT_BAN_ON_PENDING flag, but I'd suggest renaming it to something like I915_CONTEXT_BAN_ON_BATCH_LOSS. That way, if in the future, we add the ability for the kernel to re-submit pending batches upon recovery from a GPU hang, then it will be clear that I915_CONTEXT_BAN_ON_BATCH_LOSS doesn't apply to the contexts that had their batches automatically re-submitted. <br> </div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <br> I have been also working with ioctl (get_reset_stats, for arb robustness<br> extension) which allows application to sort out which contexts were<br> affected by hang. Here is the planned ioctl for arb robustness<br> extension:<br> <a href="https://github.com/mkuoppal/linux/commit/698a413472edaec78852b8ca9849961cbdc40d78" target="_blank">https://github.com/mkuoppal/linux/commit/698a413472edaec78852b8ca9849961cbdc40d78</a><br> <br> This allows applications then to detect which contexts need to resubmit<br> their state and also will give information if the context had batch<br> active or pending when the gpu hang happened.<br></blockquote><div><br></div><div>That ioctl seems reasonable to me. My only comment is that we might want to consider renaming the "batch_pending" field in drm_i915_reset_stats to "batch_loss", for similar reasons to what I stated above.</div> </div></div></div>