<html> <head> <base href="https://bugs.freedesktop.org/"> </head> <body> <div> <a class="bz_bug_link bz_status_NEW " title="NEW - Rebinding AMDGPU causes initialization errors [R9 290 / 4.10 kernel]" href="https://bugs.freedesktop.org/show_bug.cgi?id=101946#c12">Comment # 12</a> on <a class="bz_bug_link bz_status_NEW " title="NEW - Rebinding AMDGPU causes initialization errors [R9 290 / 4.10 kernel]" href="https://bugs.freedesktop.org/show_bug.cgi?id=101946">bug 101946</a> from <a class="email" href="mailto:beanow@oscp.info" title="Robin <beanow@oscp.info>"> Robin</a> <pre>Created <a href="attachment.cgi?id=133103" name="attach_133103" title="case2-rescan-amd.sh">attachment 133103</a> <a href="attachment.cgi?id=133103&action=edit" title="case2-rescan-amd.sh">[details]</a> case2-rescan-amd.sh In an attempt to make a second test case I've created a new script that produced some noteworthy results. Rather than bind/unbind, this approach uses rmmod,modprobe, removing the pci device and rescanning to switch drivers. Please excuse how poorly written and contrived the test case for "hotswapping" proposes, I'll try isolating what causes the differences with the first test case in some mutations next, but wanted to share the intermediate results as-is first. Some details about this test. The starting point is the same as the other test case, TTY and vfio-pci taking the card first. In order it will: 1. rmmod the current driver. 2. remove one pci subdevice (either VGA or Audio) 3. modprobe the new driver. 4. perform a pci rescan. It will do this in a loop switching between amdgpu and vfio-pci again. Another difference is that snd_hda_intel is in use elsewhere, it does not get an rmmod and will not switch back to vfio-pci because of this. --- As for results, on 4.10 there was no change. >From the 2nd binding onward this error will fail to init the driver. > [ 160.013733] [drm:ci_dpm_enable [amdgpu]] *ERROR* ci_start_dpm failed > [ 160.014134] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <amdgpu_powerplay> failed -22 For 4.13rc2, drm-next-4.14-wip and drm-next-4.14-wip with patch 3 it's a different story. They have an irregular pattern of errors every loop. Either the 2nd or 3rd time the first error crops up. Typically this is: > [ 211.818341] [drm:cik_sdma_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 9 test failed (0xCAFEDEAD) > [ 211.818725] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <cik_sdma> failed -22 After that first error, additionally the following error can appear as well. > [ 247.626839] [drm:gfx_v7_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 1 test failed (scratch(0xC040)=0xCAFEDEAD) And instead of ring 9, ring 10 may fail. > [ 356.686092] [drm:cik_sdma_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 10 test failed (0xCAFEDEAD) > [ 356.686580] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <cik_sdma> failed -22 They seem to randomly happen in the following combinations: A. Ring 1 fails. B. Ring 9 or 10 fails. C. Ring 1 + Ring 9 or 10 fails. Most importantly though. Only if 9 or 10 fail (B or C combinations) will the hw_init error occur. If it's just a ring 1 failure (A) the driver will successfully init the GPU. Also, the drm-next-4.14-wip with patch 3 kernel will have this A combination and successful init a lot more often that the other two. --- So my suspicion is that this difference could be due to: - Repeatedly rmmodding and modprobing being part of the loop now. - The rescanning method vs bind/unbind. - The different treatment of the Audio component. - The different access of vfio-pci to the Audio component. So I will make several variations on the test scripts to try and narrow this down.</pre> </div> <hr> You are receiving this mail because: <ul> <li>You are the assignee for the bug.</li> </ul> </body> </html>