[Bug 101946] Rebinding AMDGPU causes initialization errors [R9 290 / 4.10 kernel]

Fri Jul 28 17:26:43 UTC 2017

https://bugs.freedesktop.org/show_bug.cgi?id=101946

--- Comment #12 from Robin <beanow at oscp.info> ---
Created attachment 133103
  --> https://bugs.freedesktop.org/attachment.cgi?id=133103&action=edit
case2-rescan-amd.sh

In an attempt to make a second test case I've created a new script that
produced some noteworthy results.

Rather than bind/unbind, this approach uses rmmod,modprobe, removing the pci
device and rescanning to switch drivers.

Please excuse how poorly written and contrived the test case for "hotswapping"
proposes, I'll try isolating what causes the differences with the first test
case in some mutations next, but wanted to share the intermediate results as-is
first.

Some details about this test.

The starting point is the same as the other test case, TTY and vfio-pci taking
the card first. In order it will:

1. rmmod the current driver.
2. remove one pci subdevice (either VGA or Audio)
3. modprobe the new driver.
4. perform a pci rescan.

It will do this in a loop switching between amdgpu and vfio-pci again.

Another difference is that snd_hda_intel is in use elsewhere, it does not get
an rmmod and will not switch back to vfio-pci because of this.

---

As for results, on 4.10 there was no change.
>From the 2nd binding onward this error will fail to init the driver.
> [  160.013733] [drm:ci_dpm_enable [amdgpu]] *ERROR* ci_start_dpm failed
> [  160.014134] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <amdgpu_powerplay> failed -22

For 4.13rc2, drm-next-4.14-wip and drm-next-4.14-wip with patch 3 it's a
different story.

They have an irregular pattern of errors every loop.
Either the 2nd or 3rd time the first error crops up. Typically this is:
> [  211.818341] [drm:cik_sdma_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 9 test failed (0xCAFEDEAD)
> [  211.818725] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <cik_sdma> failed -22

After that first error, additionally the following error can appear as well.
> [  247.626839] [drm:gfx_v7_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 1 test failed (scratch(0xC040)=0xCAFEDEAD)

And instead of ring 9, ring 10 may fail.
> [  356.686092] [drm:cik_sdma_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 10 test failed (0xCAFEDEAD)
> [  356.686580] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <cik_sdma> failed -22

They seem to randomly happen in the following combinations:

A. Ring 1 fails.
B. Ring 9 or 10 fails.
C. Ring 1 + Ring 9 or 10 fails.

Most importantly though. Only if 9 or 10 fail (B or C combinations) will the
hw_init error occur. If it's just a ring 1 failure (A) the driver will
successfully init the GPU.

Also, the drm-next-4.14-wip with patch 3 kernel will have this A combination
and successful init a lot more often that the other two.

---

So my suspicion is that this difference could be due to:
- Repeatedly rmmodding and modprobing being part of the loop now.
- The rescanning method vs bind/unbind.
- The different treatment of the Audio component.
- The different access of vfio-pci to the Audio component.

So I will make several variations on the test scripts to try and narrow this
down.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20170728/87c19dd7/attachment.html>