Couple of issues with amdgpu on my WX4100

Mon Jan 4 11:34:34 UTC 2021

Hi Maxim,

I can't help with the display related stuff. Probably best approach to 
get this fixes would be to open up a bug tracker for this on FDO.

But I'm the one who implemented the resizeable BAR support and your 
analysis of the problem sounds about correct to me.

The reason why this works on Linux is most likely because we restore the 
BAR size on resume (and maybe during initial boot as well).

See this patch for reference:

commit d3252ace0bc652a1a244455556b6a549f969bf99
Author: Christian König <ckoenig.leichtzumerken at gmail.com>
Date:   Fri Jun 29 19:54:55 2018 -0500

     PCI: Restore resized BAR state on resume

     Resize BARs after resume to the expected size again.

     BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=199959
     Fixes: d6895ad39f3b ("drm/amdgpu: resize VRAM BAR for CPU access v6")
     Fixes: 276b738deb5b ("PCI: Add resizable BAR infrastructure")
     Signed-off-by: Christian König <christian.koenig at amd.com>
     Signed-off-by: Bjorn Helgaas <bhelgaas at google.com>
     CC: stable at vger.kernel.org      # v4.15+

It should be trivial to add this to the reset module as well. Most 
likely even completely vendor independent since I'm not sure what a bus 
reset will do to this configuration and restoring it all the time should 
be the most defensive approach.

Let me know if you got any more questions on this.

Regards,
Christian.

Am 02.01.21 um 23:42 schrieb Maxim Levitsky:
> Hi!
>
> I am using this card for about a year and I would like first to say thanks
> for open source driver that you made for it, for the big navi
> and for the threadripper which brought back fun to the computing.
>
> I bought that card primary to use as a host GPU in VFIO enabled multi-seat
> system I am building, and recently I was able (with a minor issue I managed to
> solve, more about it later) to pass that GPU to both linux and windows guest
> mostly flawlessly.
>   
> I do have experience in kernel development, and debugging so I am willing
> to test patches, etc. Any help is welcome!
>   
> So these are the issues:
>   
> 1.(the biggest issue): The amdgpu driver often crashes when plugging an input.
>
> I tested this now on purpose with 'amdgpu.dc=1' by slowly plugging and unplugging
> an input connector while I wait for the output to stabilize between each cycle,
> and still the issue reproduced after a dozen (or so) tries.
> (It only happens when I plug the connector, and never happens when I unplug it)
>
> Then I unloaded the amdgpu driver and loaded it again with dc=0.
> This does sort of work but takes a lot of time. The dmesg output is attached
> (amdgpu_dc1_plug_bug.txt)
>   
> I did try to increase the number of tries in dm_helpers_read_local_edid, to
> something silly like 1000, but no luck.
>   
> I also tried to remove the code below the
> 'Abort detection for non-DP connectors if we have no EDID'
> Also no luck.
>
>
> This bug pretty much makes it impossible to use the card daily as is
> since I do connect/disconnect monitors often, especially due to VFIO usage.
>   
> 2. I found out that running without the new DC framework (amdgpu.dc=0) solves
> issue 1 completely (but costs HDMI sound - HDMI sound only works with amdgpu.dc=1)
>
> I am using this card like that for about at least half an year and haven't had
> a single connector plug/unplug related crash.
>
> Issue 2 however is that in this mode (I haven't tried to reproduce this
> with amdgpu.dc=1 yet), sometimes when I unbind the amdgpu driver
> the amdgpu complains about a leaked connector and crashes a bit later on.
> I haven't yet tracked the combination of things needed
> to trigger this, but it did happen to me about 3 times already.
>   
> I did put a WARN_ON(1) to __drm_connector_put_safe, to see who
> is the caller that triggers the delayed work that frees the connector when it is
> too late.
>
> I attached a backtrace with the above WARN_ON and the crash (connector_leak_bug.txt)
> I also attached the script 'amdgpu_unbind' for the reference that I use to unbind
> the amdgpu driver.
>   
> 3. When doing VFIO passthrough of this card, I found out that it doesn't
> suffer that much from the reset bug. As long as I shut down the guest
> in clean manner, I can start it again). The vendor_reset module however
> makes the reset work even when I shut down the guest right in the middle
> of a 3D app running and I tested it many times.
>   
> _However_ this only works if I never load the amdgpu linux driver.
> Otherwise a windows guest still boots but all 3D apps in it crash very early.
>
> I tried both the stock drivers that windows auto installs and latest AMD
> workstation drivers from AMD site.
>
> Linux guests do work.
>   
> I found out that amdgpu driver resizes the device bars (I have TRX40 platform,
> so I don't know if this platform supports the AMD Smart Memory or not,
> but according to lspci the device does support resizable BARs).
>
> If I patch the amdgpu's bar resize out, then, the windows guest _does_ work
> regardless if I loaded amdgpu prior or not. Linux guests also still work.
> I haven't measured the performance impact of this.
>
> For debugging this, I did try to hide the PCI_EXT_CAP_ID_REBAR capability
> from the VM, but it made no difference.
>
> I suspect that once the GPU is resetted, the bars
> revert to their original sizes, but VFIO uses the sizes that are cached
> by the kernel, so that the guest thinks that the bars are of one size
> while they are of an another. I don't have an idea though why this
> does work with a Linux guest.
>
> I had attached the pci config with amdgpu running, once with my patch that
> stops it from resizing the bars, and once without that patch for reference.
> (amdgpu_pciconfig_noresize.txt, amdgpu_pciconfig_resize.txt)
>
>
> 4. I found out that amdgpu runtime PM sometimes breaks the card if last
> output is disconnected from it. I didn't debug it much as I just disabled
> it with amdgpu.runpm=0) I will do more debug on this later.
>   
>
> Please let me know if you have any questions,
> Don't hesitate to ask me for more information.
>
> My setup:
> 3 outputs, all HDMI, converted with DP->HDMI adapters, of which 2 are 1080P
> monitors, and 1 is a 1080P TV. The issues I describe above are reproducible
> on all the outputs.
>
>
> I am running 5.10.0 kernel with few patches and kvm-queue branch
> merged for my day to day work on KVM.
>   
> You can find the exact kernel I use and its .config on
> https://gitlab.com/maximlevitsky/linux/-/commits/kernel-starship-5.10
>   
>
> Best regards,
> 	Maxim Levitsky
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20210104/1ee7d81f/attachment-0001.htm>