<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<body>
<div dir="auto">
<div dir="auto"><span style="font-size: 12pt;">On May 19, 2021 12:16:15 Daniel Vetter <daniel.vetter@ffwll.ch> wrote:</span></div><div id="aqm-original" style="color: black;">
<div><br></div>
<blockquote type="cite" class="gmail_quote" style="margin: 0 0 0 0.75ex; border-left: 1px solid #808080; padding-left: 0.75ex;">
<div dir="auto">On Wed, May 19, 2021 at 5:06 PM Jason Ekstrand <jason@jlekstrand.net> wrote:</div>
<blockquote type="cite" class="gmail_quote" style="margin: 0 0 0 0.75ex; border-left: 1px solid #0099CC; padding-left: 0.75ex;">
<div dir="auto"><br></div>
<div dir="auto">Once we no longer rely on error propagation, I think there's a lot we</div>
<div dir="auto">can rip out.</div>
</blockquote>
<div dir="auto"><br></div>
<div dir="auto">I honestly did not find that much ... what did you uncover?</div></blockquote></div><div dir="auto"><br></div><div dir="auto">When I was digging through this earlier today, I think I convinced myself that the cmdparser is the only user of the entire fence error architecture. I may have missed something, though.</div><div dir="auto"><br></div><div dir="auto">--Jason</div><div dir="auto"><br></div><div id="aqm-original" style="color: black;" dir="auto"><blockquote type="cite" class="gmail_quote" style="margin: 0 0 0 0.75ex; border-left: 1px solid #808080; padding-left: 0.75ex;"><div dir="auto"></div>
<div dir="auto">-Daniel</div>
<div dir="auto"><br></div>
<blockquote type="cite" class="gmail_quote" style="margin: 0 0 0 0.75ex; border-left: 1px solid #0099CC; padding-left: 0.75ex;">
<div dir="auto"><br></div>
<div dir="auto">--Jason</div>
<div dir="auto"><br></div>
<div dir="auto">On Wed, May 19, 2021 at 5:15 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:</div>
<blockquote type="cite" class="gmail_quote" style="margin: 0 0 0 0.75ex; border-left: 1px solid #9933CC; padding-left: 0.75ex;">
<div dir="auto"><br></div>
<div dir="auto">From: Jason Ekstrand <jason@jlekstrand.net></div>
<div dir="auto"><br></div>
<div dir="auto">This reverts commit 9e31c1fe45d555a948ff66f1f0e3fe1f83ca63f7. Ever</div>
<div dir="auto">since that commit, we've been having issues where a hang in one client</div>
<div dir="auto">can propagate to another. In particular, a hang in an app can propagate</div>
<div dir="auto">to the X server which causes the whole desktop to lock up.</div>
<div dir="auto"><br></div>
<div dir="auto">Error propagation along fences sound like a good idea, but as your bug</div>
<div dir="auto">shows, surprising consequences, since propagating errors across security</div>
<div dir="auto">boundaries is not a good thing.</div>
<div dir="auto"><br></div>
<div dir="auto">What we do have is track the hangs on the ctx, and report information to</div>
<div dir="auto">userspace using RESET_STATS. That's how arb_robustness works. Also, if my</div>
<div dir="auto">understanding is still correct, the EIO from execbuf is when your context</div>
<div dir="auto">is banned (because not recoverable or too many hangs). And in all these</div>
<div dir="auto">cases it's up to userspace to figure out what is all impacted and should</div>
<div dir="auto">be reported to the application, that's not on the kernel to guess and</div>
<div dir="auto">automatically propagate.</div>
<div dir="auto"><br></div>
<div dir="auto">What's more, we're also building more features on top of ctx error</div>
<div dir="auto">reporting with RESET_STATS ioctl: Encrypted buffers use the same, and the</div>
<div dir="auto">userspace fence wait also relies on that mechanism. So it is the path</div>
<div dir="auto">going forward for reporting gpu hangs and resets to userspace.</div>
<div dir="auto"><br></div>
<div dir="auto">So all together that's why I think we should just bury this idea again as</div>
<div dir="auto">not quite the direction we want to go to, hence why I think the revert is</div>
<div dir="auto">the right option here.Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com></div>
<div dir="auto"><br></div>
<div dir="auto">v2: Augment commit message. Also restore Jason's sob that I</div>
<div dir="auto">accidentally lost.</div>
<div dir="auto"><br></div>
<div dir="auto">Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com> (v1)</div>
<div dir="auto">Reported-by: Marcin Slusarz <marcin.slusarz@intel.com></div>
<div dir="auto">Cc: <stable@vger.kernel.org> # v5.6+</div>
<div dir="auto">Cc: Jason Ekstrand <jason.ekstrand@intel.com></div>
<div dir="auto">Cc: Marcin Slusarz <marcin.slusarz@intel.com></div>
<div dir="auto">Cc: Jon Bloomfield <jon.bloomfield@intel.com></div>
<div dir="auto">Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/3080</div>
<div dir="auto">Fixes: 9e31c1fe45d5 ("drm/i915: Propagate errors on awaiting already signaled fences")</div>
<div dir="auto">Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch></div>
<div dir="auto">---</div>
<div dir="auto">drivers/gpu/drm/i915/i915_request.c | 8 ++------</div>
<div dir="auto">1 file changed, 2 insertions(+), 6 deletions(-)</div>
<div dir="auto"><br></div>
<div dir="auto">diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c</div>
<div dir="auto">index 970d8f4986bb..b796197c0772 100644</div>
<div dir="auto">--- a/drivers/gpu/drm/i915/i915_request.c</div>
<div dir="auto">+++ b/drivers/gpu/drm/i915/i915_request.c</div>
<div dir="auto">@@ -1426,10 +1426,8 @@ i915_request_await_execution(struct i915_request *rq,</div>
<div dir="auto"><br></div>
<div dir="auto"> do {</div>
<div dir="auto"> fence = *child++;</div>
<div dir="auto">- if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) {</div>
<div dir="auto">- i915_sw_fence_set_error_once(&rq->submit, fence->error);</div>
<div dir="auto">+ if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))</div>
<div dir="auto"> continue;</div>
<div dir="auto">- }</div>
<div dir="auto"><br></div>
<div dir="auto"> if (fence->context == rq->fence.context)</div>
<div dir="auto"> continue;</div>
<div dir="auto">@@ -1527,10 +1525,8 @@ i915_request_await_dma_fence(struct i915_request *rq, struct dma_fence *fence)</div>
<div dir="auto"><br></div>
<div dir="auto"> do {</div>
<div dir="auto"> fence = *child++;</div>
<div dir="auto">- if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) {</div>
<div dir="auto">- i915_sw_fence_set_error_once(&rq->submit, fence->error);</div>
<div dir="auto">+ if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))</div>
<div dir="auto"> continue;</div>
<div dir="auto">- }</div>
<div dir="auto"><br></div>
<div dir="auto"> /*</div>
<div dir="auto"> * Requests on the same timeline are explicitly ordered, along</div>
<div dir="auto">--</div>
<div dir="auto">2.31.0</div>
<div dir="auto"><br></div>
</blockquote>
</blockquote>
<div dir="auto"><br></div>
<div dir="auto"><br></div>
<div dir="auto"><br></div>
<div dir="auto">-- </div>
<div dir="auto">Daniel Vetter</div>
<div dir="auto">Software Engineer, Intel Corporation</div>
<div dir="auto">http://blog.ffwll.ch</div>
</blockquote>
</div><div dir="auto"><br></div>
</div></body>
</html>