[PATCH 02/23] drm/amdgpu: do RAS init in NPS mode switch

Zhang, Hawking Hawking.Zhang at amd.com
Tue Nov 12 01:23:30 UTC 2024


[AMD Official Use Only - AMD Internal Distribution Only]

In nps mode is somehow confusing. We'd like to differentiate recovery (*after* reset) from regular initialization.

Is it possible to replace in nps mode check with more general approach? In regular initialization, set ras interface available in ip late init, while in recovery, let the flag set when recovery is completed.

Regards,
Hawking

-----Original Message-----
From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Tao Zhou
Sent: Friday, November 8, 2024 7:14 PM
To: amd-gfx at lists.freedesktop.org
Cc: Zhou1, Tao <Tao.Zhou1 at amd.com>
Subject: [PATCH 02/23] drm/amdgpu: do RAS init in NPS mode switch

NPS mode switch will call gpu reset, but this is different from normal reset.

Signed-off-by: Tao Zhou <tao.zhou1 at amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    | 11 +++++++----
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index d69fcbb28b0e..635f020f8d9c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3293,7 +3293,7 @@ static int amdgpu_device_ip_late_init(struct amdgpu_device *adev)
                return r;
        }

-       if (!amdgpu_in_reset(adev))
+       if (!amdgpu_in_reset(adev) || amdgpu_in_nps_switch(adev))
                amdgpu_ras_set_error_query_ready(adev, true);

        amdgpu_device_set_cg_state(adev, AMD_CG_STATE_GATE); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index de1a55ae1d78..cbecf2380b51 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1253,7 +1253,8 @@ int amdgpu_ras_bind_aca(struct amdgpu_device *adev, enum amdgpu_ras_block blk,
        struct ras_manager *obj;

        /* in resume phase, no need to create aca fs node */
-       if (adev->in_suspend || amdgpu_in_reset(adev))
+       if (adev->in_suspend ||
+           (amdgpu_in_reset(adev) && !amdgpu_in_nps_switch(adev)))
                return 0;

        obj = get_ras_manager(adev, blk);
@@ -3780,7 +3781,8 @@ int amdgpu_ras_block_late_init(struct amdgpu_device *adev,

        r = amdgpu_ras_feature_enable_on_boot(adev, ras_block, 1);
        if (r) {
-               if (adev->in_suspend || amdgpu_in_reset(adev)) {
+               if (adev->in_suspend ||
+                   (amdgpu_in_reset(adev) && !amdgpu_in_nps_switch(adev))) {
                        /* in resume phase, if fail to enable ras,
                         * clean up all ras fs nodes, and disable ras */
                        goto cleanup;
@@ -3792,7 +3794,8 @@ int amdgpu_ras_block_late_init(struct amdgpu_device *adev,
        amdgpu_persistent_edc_harvesting(adev, ras_block);

        /* in resume phase, no need to create ras fs node */
-       if (adev->in_suspend || amdgpu_in_reset(adev))
+       if (adev->in_suspend ||
+           (amdgpu_in_reset(adev) && !amdgpu_in_nps_switch(adev)))
                return 0;

        ras_obj = container_of(ras_block, struct amdgpu_ras_block_object, ras_comm); @@ -3922,7 +3925,7 @@ int amdgpu_ras_late_init(struct amdgpu_device *adev)
        amdgpu_ras_event_mgr_init(adev);

        if (amdgpu_ras_aca_is_supported(adev)) {
-               if (amdgpu_in_reset(adev)) {
+               if (amdgpu_in_reset(adev) && !amdgpu_in_nps_switch(adev)) {
                        if (amdgpu_aca_is_enabled(adev))
                                r = amdgpu_aca_reset(adev);
                        else
--
2.34.1



More information about the amd-gfx mailing list