<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <br>
    <br>
    <div class="moz-cite-prefix">On 09/05/22 18:23, Bjorn Helgaas wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:CABhMZUW4=XUOwFAE74nebnZcKBp5pwktWufHNBpB79t3iUeQ3A@mail.gmail.com">
      <pre class="moz-quote-pre" wrap="">On Sun, May 8, 2022 at 3:29 PM <a class="moz-txt-link-rfc2396E" href="mailto:bugzilla-daemon@kernel.org"><bugzilla-daemon@kernel.org></a> wrote:
</pre>
      <blockquote type="cite">
        <pre class="moz-quote-pre" wrap=""><a class="moz-txt-link-freetext" href="https://bugzilla.kernel.org/show_bug.cgi?id=215958">https://bugzilla.kernel.org/show_bug.cgi?id=215958</a>

            Bug ID: 215958
           Summary: thunderbolt3 egpu cannot disconnect cleanly
           Product: Drivers
           Version: 2.5
    Kernel Version: 5.17.0-1003-oem #3-Ubuntu SMP PREEMPT
          Hardware: All
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: PCI
          Assignee: <a class="moz-txt-link-abbreviated moz-txt-link-freetext" href="mailto:drivers_pci@kernel-bugs.osdl.org">drivers_pci@kernel-bugs.osdl.org</a>
          Reporter: <a class="moz-txt-link-abbreviated moz-txt-link-freetext" href="mailto:r087r70@yahoo.it">r087r70@yahoo.it</a>
        Regression: No
</pre>
      </blockquote>
      <pre class="moz-quote-pre" wrap="">I assume this is not a regression, right?  If it is a regression, what
previous kernel worked correctly?</pre>
    </blockquote>
    <br>
    no it's not, but I haven't tested with all the possible kernel
    versions, just with 5.15 and 5.17<br>
    <br>
    <blockquote type="cite"
cite="mid:CABhMZUW4=XUOwFAE74nebnZcKBp5pwktWufHNBpB79t3iUeQ3A@mail.gmail.com">
      <blockquote type="cite">
        <pre class="moz-quote-pre" wrap="">I have an external egpu (Radeon 6600 RX) connected through thunderbolt3 to my
Thinkpad X1 carbon 6th Gen.. When I disconnect the thunderbolt3 cable I get the
following error in dmesg:

[21874.194994] amdgpu 0000:0c:00.0: amdgpu: SMU: response:0xFFFFFFFF for
index:18 param:0x00000005 message:TransferTableSmu2Dram?
...
...
[21879.544226] amdgpu 0000:0c:00.0: amdgpu: Failed to disable smu features.
[21879.544230] amdgpu 0000:0c:00.0: amdgpu: Fail to disable dpm features!
[21879.544238] [drm] free PSP TMR buffer
</pre>
      </blockquote>
      <pre class="moz-quote-pre" wrap="">The above looks like what amdgpu would see when the GPU is no longer
accessible (writes are dropped and reads return 0xffffffff).  It's
possible amdgpu could notice this and shut down more gracefully, but I
don't think it's the main problem here and it probably wouldn't force
you to reboot.</pre>
    </blockquote>
    <br>
    actually in this state I cannot `modprobe -r amdgpu`:<br>
    <br>
    <font face="monospace">modprobe: FATAL: Module amdgpu is in use.</font><br>
    <br>
    <br>
    <br>
    <blockquote type="cite"
cite="mid:CABhMZUW4=XUOwFAE74nebnZcKBp5pwktWufHNBpB79t3iUeQ3A@mail.gmail.com">
      <blockquote type="cite">
        <pre class="moz-quote-pre" wrap="">[21880.455935] i915 0000:00:02.0: vgaarb: changed VGA decodes:
olddecodes=none,decodes=io+mem:owns=io+mem
[21880.456218] pci 0000:0c:00.0: Removing from iommu group 14
...
...
[21880.457311] pci_bus 0000:09: busn_res: [bus 09-3a] is released
[21880.457543] pci 0000:08:00.0: Removing from iommu group 14
</pre>
      </blockquote>
      <pre class="moz-quote-pre" wrap="">This looks like removing 0c:00.0 (the GPU) and two switches leading to
it (probably part of the Thunderbolt topology), so to be expected.

</pre>
      <blockquote type="cite">
        <pre class="moz-quote-pre" wrap="">[21880.457847] pci_bus 0000:06: Allocating resources
[21880.457888] pcieport 0000:06:02.0: bridge window [io  0x1000-0x0fff] to [bus
3b] add_size 1000
...
...
[21880.457947] pcieport 0000:06:02.0: BAR 13: failed to assign [io  size
0x1000]
</pre>
      </blockquote>
      <pre class="moz-quote-pre" wrap="">I'm not sure why we're allocating resources as part of the removal.
The hierarchies under 06:02.0 (to [bus 3b]) and 06:04.0 (to [bus
3c-6f]) seem to be siblings of the hierarchy you just removed (my
guess is that was 06:01.0 to [bus 08-3a]).  But again, shouldn't
require a reboot.

</pre>
      <blockquote type="cite">
        <pre class="moz-quote-pre" wrap="">upon reconnection of the cable I get:

[22192.753261] input: HDA ATI HDMI HDMI/DP,pcm=3 as
/devices/pci0000:00/0000:00:1d.0/0000:05:00.0/0000:06:01.0/0000:08:00.0/0000:09:01.0/0000:0a:00.0/0000:0b:00.0/0000:0c:00.1/sound/card1/input98
[22192.753738] input: HDA ATI HDMI HDMI/DP,pcm=7 as
/devices/pci0000:00/0000:00:1d.0/0000:05:00.0/0000:06:01.0/0000:08:00.0/0000:09:01.0/0000:0a:00.0/0000:0b:00.0/0000:0c:00.1/sound/card1/input99
[22192.753952] input: HDA ATI HDMI HDMI/DP,pcm=8 as
/devices/pci0000:00/0000:00:1d.0/0000:05:00.0/0000:06:01.0/0000:08:00.0/0000:09:01.0/0000:0a:00.0/0000:0b:00.0/0000:0c:00.1/sound/card1/input100
[22192.755234] input: HDA ATI HDMI HDMI/DP,pcm=9 as
/devices/pci0000:00/0000:00:1d.0/0000:05:00.0/0000:06:01.0/0000:08:00.0/0000:09:01.0/0000:0a:00.0/0000:0b:00.0/0000:0c:00.1/sound/card1/input101
[22192.763885] input: HDA ATI HDMI HDMI/DP,pcm=10 as
/devices/pci0000:00/0000:00:1d.0/0000:05:00.0/0000:06:01.0/0000:08:00.0/0000:09:01.0/0000:0a:00.0/0000:0b:00.0/0000:0c:00.1/sound/card1/input102
[22192.975773] thunderbolt 0-1: new device found, vendor=0x127 device=0x1
[22192.975786] thunderbolt 0-1: Razer Core X

but the egpu no longer appears in `xrandr --listproviders`. Full reboot is
needed.
</pre>
      </blockquote>
      <pre class="moz-quote-pre" wrap="">Can you please build with CONFIG_DYNAMIC_DEBUG=y, boot with
'dyndbg="file pciehp* +p"', and attach the complete dmesg log to the
bugzilla?  Also please attach the complete "sudo lspci -vv" output
(before the unplug and after the replug)?
</pre>
    </blockquote>
    <br>
    Ironically, I have rebooted to get the lspci output,  and now I can
    no longer get into the above state. What I get is that after
    attaching the egpu, it is enabled *without* the need of restarting
    the Xserver, while after detaching it the Xserver is restarted and
    the card gets released correctly, although the amdgpu drivers stays
    loaded. But I can `modprobe -r amdgpu` without problems. I could
    connect/disconnect many time without issues. Attached is the lspci
    output after fresh boot, upon epgu connection, and disconnection.<br>
    <br>
    I will test more in the next days.<br>
    <br>
    Thank you,<br>
    Roberto<br>
    <br>
    <br>
    <br>
    <br>
    <br>
  </body>
</html>