Fixing system lockup after GPU reset on SI (Radeon HD 7970 GHz edition)
Matthew Dawson
matthew at mjdsystems.ca
Sat Jan 16 23:10:06 PST 2016
Hi all,
I'm trying to work through this bug: https://bugs.freedesktop.org/
show_bug.cgi?id=93649 . The main symptom that something has gone wrong is the
system locks up, with some process trying to reset the gpu while the gpu is
trying to be reset which deadlocks. The system still works over ssh, just the
graphics get stuck.
I'm trying to fix the kernel side of this first, so my gpu can reliably reset
when the game triggers the gpu lockup, after which I'll try tracking down the
mesa issue which causes the lockup in the first place. I've started some
preliminary investigating, but I'm running out of ideas as public
documentation on some of the AMD hardware is currently not available.
As far as I can tell, when the radeon module tries to reset the GPU it will
always fail to bring up the VCE (which I haven't looked at yet, as it doesn't
seem to be involved with this issue.) and the UVD. The VCE failure is caught
early, and so the kernel module just ignores the whole thing. However, the
UVD claims to initialize properly. But when the kernel module tries to run a
test IB on the UVD ring, it stalls forever. Note: before any issues, the UVD
works on my GPU, tested with a random media file and vlc.
I poked IRC some time ago, where Dave Airlie suggested that UVD is really
unhappy with being reset, and to try disabling that as a test. Nothing I
tried yielded any improvement. I also noticed that the SMC (I assume that is
some sort of power manager? I didn't find anything on it besides the source
code) fails to initialize after a reset, with the error:
[drm:si_dpm_set_power_state [radeon]] *ERROR* si_set_sw_state failed
I'm wondering if this might be causing the issue instead, as the source code
fiddles with the UVD after this error. Not knowing more, I can't say for
sure.
Details on testing done:
For the UVD, I tried forcing it to be completely reset by setting the
appropriate bit in SRBM_SOFT_RESET, but that still cause the failure to happen
in the same place.
Based on the advice from IRC, I tried disabling large parts of the UVD startup
and shutdown code, to avoid disabling anything. Some of the initialization
process also disabled parts of the UVD, which is which it was disabled. There
was no change. Note the initial start was never changed, and vlc was always
able to play a video using it.
Suspecting the SMC, I've got the return code from the message sent in
si_set_sw_state. It always returns 0x0, which doesn't have a name in the
source code. I guess this means a timeout, from looking at the code. I have
no idea where to look further I couldn't find any documentation. If there is
any I missed, I'd be happy to take a look and see what is going on. I also
captured traces of every command sent to the SMC, if that would help. I
haven't checked them much, other then to note they are different then on boot.
Also, is there a bit in either GRBM_SOFT_RESET or SRBM_SOFT_RESET to reset the
SMC? I'm just curious if that might help.
I've been using vlc playing a movie while forcing a gpu reset through debugfs
to speed up testing, as it quickly and reliably causes this issue. I can also
reproduce this with TF2 reliably, it just takes 30-60 minutes to test. For
solutions I was hopeful on, I'd use TF2 to confirm that vlc using the UVD
wasn't causing a failure on reset different from the TF2 one.
Any help in debugging this issue would be greatly appreciated. Any
documentation I can review to better understand the GPU would be helpful. I
already checked the documentation linked to from the fdo wiki, but it didn't
mention this part.
One last thing, I can partial work around the hang by allowing the ib test of
the UVD to time out. I've used a long time out (20 seconds) for testing.
Would a patch limiting this be accepted? It might allow users who run into
this to recover (sometimes TF2 will recover thanks to that workaround, and
continuing playing. Sometimes the system still lockups due to other issues,
but those don't seem to be hardware errors so I rather work on that later).
Right now I add a timeout to every call to radeon_fence_wait, but if that
isn't a good idea I could add another similar function
(radeon_fence_wait_timeout?) that takes a timeout, and update the ring tests
appropriately.
Thanks for reading my wall of text,
--
Matthew
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5584 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/dri-devel/attachments/20160117/8a312fe3/attachment-0001.bin>
More information about the dri-devel
mailing list