[PATCH v2] drm/amd: Add pre-zen AMD hardware to PCIe dynamic switching exclusions
Alex Deucher
alexdeucher at gmail.com
Sun Apr 6 19:58:10 UTC 2025
On Thu, Apr 3, 2025 at 3:13 PM Mario Limonciello <superm1 at kernel.org> wrote:
>
> On 4/3/2025 10:48 AM, Alex Deucher wrote:
> > On Wed, Apr 2, 2025 at 11:12 PM Mario Limonciello <superm1 at kernel.org> wrote:
> >>
> >> From: Mario Limonciello <mario.limonciello at amd.com>
> >>
> >> AMD RX580 when added AMD Phenom 2 has problems with overheating. This is due to
> >
> > I don't think this is entirely accurate. I think the GPU gets hot
> > because the device hangs due to a problem with changing the PCIe
> > clocks.
> >
> >> changes with PCIe dynamic switching introduced by commit 466a7d115326e
> >> ("drm/amd: Use the first non-dGPU PCI device for BW limits").
> >>
> >> To avoid risks of other issues with old hardware require at least Zen hardware
> >> for AMD side to enable PCIe dynamic switching.
> >
> > I'm pretty sure PCIe reclocking worked on pre-Zen hardware. We've
> > supported this on our GPUs going back at least 15 or more years. I
> > suspect the actual problem is that some links may not reliably train
> > at the full bandwidth on some motherboards. Forcing a higher link
> > speed may cause problems.
>
> That seems odd to me it would advertise a higher link speed than it
> could train at.
That's why we train the link; to determine what speed is reliable. It
could be that there is a marginal trace on the motherboard that has
deteriorated over time or was never reliable to begin with. It would
be interesting to know if the link used to work reliably on this
board.
>
> > Maybe it would be better to limit the max
> > PCIe link rate to whatever the link is currently trained to. IIRC,
> > PCIe links will train at the fastest link possible by default. The
> > previous behavior was to limit the max clock to the slowest link in
> > the topology to save power, but then we changed it to use the fastest
> > link possible based on the PCIe link caps. Perhaps limiting it to the
> > fastest currently trained link rate would be better.
>
> I mean that's essentially what happens when
> amdgpu_device_pcie_dynamic_switching_supported() returns that it doesn't
> work.
I mean rather than checking the PCIe caps, check the current link
speed instead. pcie_bandwidth_available() returns the speed and lanes
of the slowest link in the topology; what we want is the current speed
that the link upstream of the GPU is trained at. If there is no
USB4/TB or limited speed bridge upstream of the GPU, then that
function should return the current speed of the link which would be
fine. The problem is that
amdgpu_device_pcie_dynamic_switching_supported() returning false
disables PCIe DPM so we don't dynamically change the PCIe speed/lanes
at runtime. I suspect that would work fine as long as we don't go
past the current speed the link is currently trained at.
>
> If your theory is right; maybe what we really need is a pile of DMI
> quirks for M/B that are having this problem.
Depends on whether it's a general problem or something specific to
this particular board. I.e., the slot on this board has deteriorated.
I think what we want is to enable PCIe DPM, but just limit the link
the the max current speed rather than the max speed. If the links are
reliable the links should train at the max speed on power up.
Alex
>
> >
> > Alex
> >
> >>
> >> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4098
> >> Fixes: 466a7d115326e ("drm/amd: Use the first non-dGPU PCI device for BW limits")
> >> Signed-off-by: Mario Limonciello <mario.limonciello at amd.com>
> >> ---
> >> v2:
> >> * Cover more hardware
> >> ---
> >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++++
> >> 1 file changed, 5 insertions(+)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> index a30111d2c3ea0..caa44ee788c8f 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> @@ -1854,6 +1854,9 @@ bool amdgpu_device_seamless_boot_supported(struct amdgpu_device *adev)
> >> *
> >> * https://edc.intel.com/content/www/us/en/design/products/platforms/details/raptor-lake-s/13th-generation-core-processors-datasheet-volume-1-of-2/005/pci-express-support/
> >> * https://gitlab.freedesktop.org/drm/amd/-/issues/2663
> >> + *
> >> + * AMD Phenom II X6 1090T has a similar issue
> >> + * https://gitlab.freedesktop.org/drm/amd/-/issues/4098
> >> */
> >> static bool amdgpu_device_pcie_dynamic_switching_supported(struct amdgpu_device *adev)
> >> {
> >> @@ -1866,6 +1869,8 @@ static bool amdgpu_device_pcie_dynamic_switching_supported(struct amdgpu_device
> >>
> >> if (c->x86_vendor == X86_VENDOR_INTEL)
> >> return false;
> >> + if (c->x86_vendor == X86_VENDOR_AMD && !cpu_feature_enabled(X86_FEATURE_ZEN))
> >> + return false;
> >> #endif
> >> return true;
> >> }
> >> --
> >> 2.43.0
> >>
>
More information about the amd-gfx
mailing list