Patch "drm/amdgpu: handle the case of pci_channel_io_frozen only in amdgpu_pci_resume" has been added to the 5.10-stable tree

gregkh at linuxfoundation.org gregkh at linuxfoundation.org
Sun Aug 24 08:06:20 UTC 2025


This is a note to let you know that I've just added the patch titled

    drm/amdgpu: handle the case of pci_channel_io_frozen only in amdgpu_pci_resume

to the 5.10-stable tree which can be found at:
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
     drm-amdgpu-handle-the-case-of-pci_channel_io_frozen-only-in-amdgpu_pci_resume.patch
and it can be found in the queue-5.10 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let <stable at vger.kernel.org> know about it.


>From stable+bounces-167102-greg=kroah.com at vger.kernel.org Tue Aug 12 08:37:31 2025
From: Shivani Agarwal <shivani.agarwal at broadcom.com>
Date: Mon, 11 Aug 2025 23:23:49 -0700
Subject: drm/amdgpu: handle the case of pci_channel_io_frozen only in amdgpu_pci_resume
To: stable at vger.kernel.org, gregkh at linuxfoundation.org
Cc: bcm-kernel-feedback-list at broadcom.com, linux-kernel at vger.kernel.org, ajay.kaher at broadcom.com, alexey.makhalov at broadcom.com, tapas.kundu at broadcom.com, alexander.deucher at amd.com, christian.koenig at amd.com, airlied at gmail.com, simona at ffwll.ch, lijo.lazar at amd.com, mario.limonciello at amd.com, sunil.khatri at amd.com, srinivasan.shanmugam at amd.com, siqueira at igalia.com, cesun102 at amd.com, linux at treblig.org, zhangzekun11 at huawei.com, andrey.grodzovsky at amd.com, amd-gfx at lists.freedesktop.org, dri-devel at lists.freedesktop.org, Guchun Chen <guchun.chen at amd.com>, Sasha Levin <sashal at kernel.org>, Shivani Agarwal <shivani.agarwal at broadcom.com>
Message-ID: <20250812062349.149549-1-shivani.agarwal at broadcom.com>

From: Guchun Chen <guchun.chen at amd.com>

[ Upstream commit 248b061689a40f4fed05252ee2c89f87cf26d7d8 ]

In current code, when a PCI error state pci_channel_io_normal is detectd,
it will report PCI_ERS_RESULT_CAN_RECOVER status to PCI driver, and PCI
driver will continue the execution of PCI resume callback report_resume by
pci_walk_bridge, and the callback will go into amdgpu_pci_resume
finally, where write lock is releasd unconditionally without acquiring
such lock first. In this case, a deadlock will happen when other threads
start to acquire the read lock.

To fix this, add a member in amdgpu_device strucutre to cache
pci_channel_state, and only continue the execution in amdgpu_pci_resume
when it's pci_channel_io_frozen.

Fixes: c9a6b82f45e2 ("drm/amdgpu: Implement DPC recovery")
Suggested-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
Signed-off-by: Guchun Chen <guchun.chen at amd.com>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
Signed-off-by: Sasha Levin <sashal at kernel.org>
[Shivani: Modified to apply on 5.10.y]
Signed-off-by: Shivani Agarwal <shivani.agarwal at broadcom.com>
Signed-off-by: Greg Kroah-Hartman <gregkh at linuxfoundation.org>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |    1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |    6 ++++++
 2 files changed, 7 insertions(+)

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -997,6 +997,7 @@ struct amdgpu_device {
 
 	bool                            in_pci_err_recovery;
 	struct pci_saved_state          *pci_state;
+	pci_channel_state_t		pci_channel_state;
 };
 
 static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4944,6 +4944,8 @@ pci_ers_result_t amdgpu_pci_error_detect
 		return PCI_ERS_RESULT_DISCONNECT;
 	}
 
+	adev->pci_channel_state = state;
+
 	switch (state) {
 	case pci_channel_io_normal:
 		return PCI_ERS_RESULT_CAN_RECOVER;
@@ -5079,6 +5081,10 @@ void amdgpu_pci_resume(struct pci_dev *p
 
 	DRM_INFO("PCI error: resume callback!!\n");
 
+	/* Only continue execution for the case of pci_channel_io_frozen */
+	if (adev->pci_channel_state != pci_channel_io_frozen)
+		return;
+
 	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
 		struct amdgpu_ring *ring = adev->rings[i];
 


Patches currently in stable-queue which might be from shivani.agarwal at broadcom.com are

queue-5.10/btrfs-fix-deadlock-when-cloning-inline-extents-and-using-qgroups.patch
queue-5.10/ptp-fix-possible-memory-leak-in-ptp_clock_register.patch
queue-5.10/scsi-pm80xx-fix-memory-leak-during-rmmod.patch
queue-5.10/block-don-t-call-rq_qos_ops-done_bio-if-the-bio-isn-t-tracked.patch
queue-5.10/drm-amdgpu-handle-the-case-of-pci_channel_io_frozen-only-in-amdgpu_pci_resume.patch
queue-5.10/scsi-lpfc-fix-link-down-processing-to-address-null-pointer-dereference.patch
queue-5.10/rdma-rxe-return-cqe-error-if-invalid-lkey-was-supplied.patch


More information about the amd-gfx mailing list