[PATCH] drm/radeon: Avoid double gpu reset by adding a timeout on IB ring tests.
matthew at mjdsystems.ca
Mon Feb 8 14:51:03 UTC 2016
On Wednesday, February 3, 2016 1:24:49 PM EST Christian König wrote:
> Am 31.01.2016 um 19:50 schrieb Matthew Dawson:
> > On Sunday, January 24, 2016 10:49:00 AM EST Christian König wrote:
> >> Am 24.01.2016 um 07:18 schrieb Matthew Dawson:
> >>> When the radeon driver resets a gpu, it attempts to test whether all the
> >>> rings can successfully handle an IB. If these rings fail to respond,
> >>> the
> >>> process will wait forever. Another gpu reset can't happen at this
> >>> point,
> >>> as the current reset holds a lock required to do so. Instead, make all
> >>> the IB tests run with a timeout, so the system can attempt to recover
> >>> in this case.
> >>> While this doesn't fix the underlying issue with card resets failing, it
> >>> gives the system a higher chance of recovering. These timeouts have
> >>> been
> >>> confirmed to help both a Tathi and Hawaii card recover after a gpu
> >>> reset.
> >>> I haven't been able to test this on other cards, but everything
> >>> compiles.
> >>> I wasn't sure what to use for a timeout value, so for now I've used
> >>> radeon_lockup_timeout. When that is 0, the timeout functionality in
> >>> this
> >>> patch is also disabled.
> >>> This also adds a new function, radeon_fence_wait_timeout, that behaves
> >>> like
> >>> radeon_fence_wait except allowing a different timeout value to be passed
> >>> to
> >>> radeon_fence_wait_seq_timeout.
> >>> Signed-off-by: Matthew Dawson <matthew at mjdsystems.ca>
> >> Really good idea, but please don't use radeon_lockup_timeout for this
> >> cause the 10 seconds default of that is way to long. Instead just use a
> >> fixed, let's say 100ms timeout defined in radeon.h.
> > I originally tried 100ms, but I found that to be too short (my system
> > would
> > just lockup completely (no network even)). I found 1s is long enough, and
> > isn't so long to be annoying. Would that be ok?
> Yeah, 1s works as well. But that 100ms shouldn't be enough sounds like
> another bug to me.
Probably, but it has a tendency to crash my entire system, so I'm not sure how
to debug it further. Netconsole maybe? I probably won't be able to dig into
this due to time since 1s does work.
> >> Also don't define our own radeon_fence_wait_timeout, just use
> >> fence_wait_timeout(&fence->base, RADEON_IB_TEST_TIMEOUT) for this.
> > I switched everything over to this, however the ib test always times out
> > after a gpu reset (on startup, everything is normal). I'm not sure why
> > fence_wait_timeout isn't being signaled while
> > radeon_fence_wait_seq_timeout
> > is. I'm suspicious that either radeon_fence_enable_signaling is not doing
> > something necessary to get the fence completion to actual signal
> > radeon_fence_default_wait, or because we hold the exclusive lock during a
> > reset radeon_fence_check_lockup never runs and thus never triggers the
> > fence or enables some irq lines.
> > Is it worth digging through the interactions here to get dma-buf fences to
> > work correctly during a lockup, or would it be better to just keep
> > radeon_fence_wait_timeout? If option 1 is preferred, I'm happy to learn
> > but I need some help learning how it is supposed to work.
> In this case feel free to add the radeon_fence_wait_timeout. It's
> probably a good idea to get rid of the exclusive lock sooner or later in
> radeon as well, but that really is a long term project.
> If you're really bored I can give you tons of input where to start with
That sounds like a good project to learn more about the driver, so I'd be
interested. I don't have a lot of time right now to play with it though. I
do want to finish off the tf2 crasher first as well. Once I have time I'll
poke the list to get started.
Thanks for the reviews!
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 819 bytes
Desc: This is a digitally signed message part.
More information about the dri-devel