Fixing system lockup after GPU reset on SI (Radeon HD 7970 GHz edition)

Sat Jan 16 23:10:06 PST 2016

Hi all,

I'm trying to work through this bug: https://bugs.freedesktop.org/
show_bug.cgi?id=93649 .  The main symptom that something has gone wrong is the 
system locks up, with some process trying to reset the gpu while the gpu is 
trying to be reset which deadlocks.  The system still works over ssh, just the 
graphics get stuck.

I'm trying to fix the kernel side of this first, so my gpu can reliably reset 
when the game triggers the gpu lockup, after which I'll try tracking down the 
mesa issue which causes the lockup in the first place.  I've started some 
preliminary investigating, but I'm running out of ideas as public 
documentation on some of the AMD hardware is currently not available.

As far as I can tell, when the radeon module tries to reset the GPU it will 
always fail to bring up the VCE (which I haven't looked at yet, as it doesn't 
seem to be involved with this issue.) and the UVD.  The VCE failure is caught 
early, and so the kernel module just ignores the whole thing.  However, the 
UVD claims to initialize properly.  But when the kernel module tries to run a 
test IB on the UVD ring, it stalls forever.  Note: before any issues, the UVD 
works on my GPU, tested with a random media file and vlc.

I poked IRC some time ago, where Dave Airlie suggested that UVD is really 
unhappy with being reset, and to try disabling that as a test.  Nothing I 
tried yielded any improvement.  I also noticed that the SMC (I assume that is 
some sort of power manager?  I didn't find anything on it besides the source 
code) fails to initialize after a reset, with the error:
[drm:si_dpm_set_power_state [radeon]] *ERROR* si_set_sw_state failed
I'm wondering if this might be causing the issue instead, as the source code 
fiddles with the UVD after this error.  Not knowing more, I can't say for 
sure.

Details on testing done:
For the UVD, I tried forcing it to be completely reset by setting the 
appropriate bit in SRBM_SOFT_RESET, but that still cause the failure to happen 
in the same place.
Based on the advice from IRC, I tried disabling large parts of the UVD startup 
and shutdown code, to avoid disabling anything.  Some of the initialization 
process also disabled parts of the UVD, which is which it was disabled.  There 
was no change.  Note the initial start was never changed, and vlc was always 
able to play a video using it.
Suspecting the SMC, I've got the return code from the message sent in 
si_set_sw_state.  It always returns 0x0, which doesn't have a name in the 
source code.  I guess this means a timeout, from looking at the code.  I have 
no idea where to look further I couldn't find any documentation.  If there is 
any I missed, I'd be happy to take a look and see what is going on.  I also 
captured traces of every command sent to the SMC, if that would help.  I 
haven't checked them much, other then to note they are different then on boot.  
Also, is there a bit in either GRBM_SOFT_RESET or SRBM_SOFT_RESET to reset the 
SMC?  I'm just curious if that might help.

I've been using vlc playing a movie while forcing a gpu reset through debugfs 
to speed up testing, as it quickly and reliably causes this issue.  I can also 
reproduce this with TF2 reliably, it just takes 30-60 minutes to test.  For 
solutions I was hopeful on, I'd use TF2 to confirm that vlc using the UVD 
wasn't causing a failure on reset different from the TF2 one.

Any help in debugging this issue would be greatly appreciated.  Any 
documentation I can review to better understand the GPU would be helpful.  I 
already checked the documentation linked to from the fdo wiki, but it didn't 
mention this part.

One last thing, I can partial work around the hang by allowing the ib test of 
the UVD to time out.  I've used a long time out (20 seconds) for testing.  
Would a patch limiting this be accepted?  It might allow users who run into 
this to recover (sometimes TF2 will recover thanks to that workaround, and 
continuing playing.  Sometimes the system still lockups due to other issues, 
but those don't seem to be hardware errors so I rather work on that later).  
Right now I add a timeout to every call to radeon_fence_wait, but if that 
isn't a good idea I could add another similar function 
(radeon_fence_wait_timeout?) that takes a timeout, and update the ring tests 
appropriately.

Thanks for reading my wall of text,
-- 
Matthew
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5584 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/dri-devel/attachments/20160117/8a312fe3/attachment-0001.bin>