[PATCH 02/10] drm/amdgpu: Use init level for pending_reset flag

Lazar, Lijo lijo.lazar at amd.com
Wed Sep 4 03:23:46 UTC 2024



On 9/4/2024 7:40 AM, Xu, Feifei wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> Comment inline.
> 
> Thanks,
> Feifei
> 
> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Lijo Lazar
> Sent: Monday, September 2, 2024 3:34 PM
> To: amd-gfx at lists.freedesktop.org
> Cc: Zhang, Hawking <Hawking.Zhang at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>
> Subject: [PATCH 02/10] drm/amdgpu: Use init level for pending_reset flag
> 
> Drop pending_reset flag in gmc block. Instead use init level to determine which type of init is preferred - in this case MINIMAL.
> 
> Signed-off-by: Lijo Lazar <lijo.lazar at amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 33 ++++---------------
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |  1 -
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h       |  1 -
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c       |  2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c      |  6 ++--
>  .../gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c    |  3 +-
>  6 files changed, 13 insertions(+), 33 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 4fb09c4fbf22..db5046e8b10d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -1691,7 +1691,7 @@ bool amdgpu_device_need_post(struct amdgpu_device *adev)
>         }
> 
>         /* Don't post if we need to reset whole hive on init */
> -       if (adev->gmc.xgmi.pending_reset)
> +       if (adev->init_lvl->level == AMDGPU_INIT_LEVEL_MINIMAL)
>                 return false;
> 
>         if (adev->has_hw_reset) {
> @@ -2985,7 +2985,7 @@ static int amdgpu_device_ip_init(struct amdgpu_device *adev)
>                 amdgpu_ttm_set_buffer_funcs_status(adev, true);
> 
>         /* Don't init kfd if whole hive need to be reset during init */
> -       if (!adev->gmc.xgmi.pending_reset) {
> +       if (adev->init_lvl->level != AMDGPU_INIT_LEVEL_MINIMAL) {
>                 kgd2kfd_init_zone_device(adev);
>                 amdgpu_amdkfd_device_init(adev);
>         }
> @@ -3499,14 +3499,9 @@ static int amdgpu_device_ip_suspend_phase2(struct amdgpu_device *adev)
>                 }
> 
>                 /* skip unnecessary suspend if we do not initialize them yet */
> -               if (adev->gmc.xgmi.pending_reset &&
> -                   !(adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_GMC ||
> -                     adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_SMC ||
> -                     adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_COMMON ||
> -                     adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_IH)) {
> -                       adev->ip_blocks[i].status.hw = false;
> +               if (!amdgpu_ip_member_of_hwini(
> +                           adev, adev->ip_blocks[i].version->type))
>                         continue;
> -               }
> 
> [Feifei]:  AMDGPU_INIT_LEVEL_MINIMAL indicate the minimal necessary blocks which need to do hw_init if SMC need to handle the mode1 reset. Though in newer ASICs it is smc doing the reset, in some old one, it is MP0.
>                Is it more readable if we use naming like AMDGPU_INIT_LEVEL_MINIMAL_SMC to avoid confusion ?

Original intention for levels is like -

	Define a single 'minimal' level init required for the SOC. Further
levels like suspend, s0i3, emulation/simulation etc. may be introduced
later which defines the level of initialization required for those
scenarios. Basically, the idea was to make it SOC specific with a callback.

It is kept this way now as the immediate purpose is to support 'minimal'
init required for XGMI-reset-on-init scenario for limited SOCs. In that
sense, this could be renamed that way also.

> 
> 
>                 /* skip suspend of gfx/mes and psp for S0ix
>                  * gfx is in gfxoff state, so on resume it will exit gfxoff just @@ -4320,20 +4315,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>         if (!amdgpu_sriov_vf(adev) && amdgpu_asic_need_reset_on_init(adev)) {
>                 if (adev->gmc.xgmi.num_physical_nodes) {
>                         dev_info(adev->dev, "Pending hive reset.\n");
> -                       adev->gmc.xgmi.pending_reset = true;
> -                       /* Only need to init necessary block for SMU to handle the reset */
> -                       for (i = 0; i < adev->num_ip_blocks; i++) {
> -                               if (!adev->ip_blocks[i].status.valid)
> -                                       continue;
> -                               if (!(adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_GMC ||
> -                                     adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_COMMON ||
> -                                     adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_IH ||
> -                                     adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_SMC)) {
> -                                       DRM_DEBUG("IP %s disabled for hw_init.\n",
> -                                               adev->ip_blocks[i].version->funcs->name);
> -                                       adev->ip_blocks[i].status.hw = true;
> -                               }
> -                       }
> +                       amdgpu_set_init_level(adev, AMDGPU_INIT_LEVEL_MINIMAL);
>                 } else if (amdgpu_ip_version(adev, MP1_HWIP, 0) == IP_VERSION(13, 0, 10) &&
>                                    !amdgpu_device_has_display_hardware(adev)) {
>                                         r = psp_gpu_reset(adev);
> @@ -4441,7 +4423,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>         /* enable clockgating, etc. after ib tests, etc. since some blocks require
>          * explicit gating rather than handling it automatically.
>          */
> -       if (!adev->gmc.xgmi.pending_reset) {
> +       if (adev->init_lvl->level != AMDGPU_INIT_LEVEL_MINIMAL) {
>                 r = amdgpu_device_ip_late_init(adev);
>                 if (r) {
>                         dev_err(adev->dev, "amdgpu_device_ip_late_init failed\n"); @@ -4518,7 +4500,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>         if (px)
>                 vga_switcheroo_init_domain_pm_ops(adev->dev, &adev->vga_pm_domain);
> 
> -       if (adev->gmc.xgmi.pending_reset)
> +       if (adev->init_lvl->level == AMDGPU_INIT_LEVEL_MINIMAL)
>                 queue_delayed_work(system_wq, &mgpu_info.delayed_reset_work,
>                                    msecs_to_jiffies(AMDGPU_RESUME_MS));
> 
> @@ -5490,7 +5472,6 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>                 list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
>                         /* For XGMI run all resets in parallel to speed up the process */
>                         if (tmp_adev->gmc.xgmi.num_physical_nodes > 1) {
> -                               tmp_adev->gmc.xgmi.pending_reset = false;
>                                 if (!queue_work(system_unbound_wq, &tmp_adev->xgmi_reset_work))
>                                         r = -EALREADY;
>                         } else
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 82bde5132dc6..3dece2e69608 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -2495,7 +2495,6 @@ static void amdgpu_drv_delayed_reset_work_handler(struct work_struct *work)
>         for (i = 0; i < mgpu_info.num_dgpu; i++) {
>                 adev = mgpu_info.gpu_ins[i].adev;
>                 flush_work(&adev->xgmi_reset_work);
> -               adev->gmc.xgmi.pending_reset = false;
>         }
> 
>         /* reset function will rebuild the xgmi hive info , clear it now */ diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> index 4d951a1baefa..33b2adffd58b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> @@ -182,7 +182,6 @@ struct amdgpu_xgmi {
>         bool supported;
>         struct ras_common_if *ras_if;
>         bool connected_to_cpu;
> -       bool pending_reset;
>         struct amdgpu_xgmi_ras *ras;
>  };
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 61a2f386d9fb..2076f157cb6a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -3185,7 +3185,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
>          * when the GPU is pending on XGMI reset during probe time
>          * (Mostly after second bus reset), skip it now
>          */
> -       if (adev->gmc.xgmi.pending_reset)
> +       if (adev->init_lvl->level == AMDGPU_INIT_LEVEL_MINIMAL)
>                 return 0;
>         ret = amdgpu_ras_eeprom_init(&con->eeprom_control);
>         /*
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> index 7de449fae1e3..a7a892512cb9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> @@ -860,7 +860,7 @@ int amdgpu_xgmi_add_device(struct amdgpu_device *adev)
>         if (!adev->gmc.xgmi.supported)
>                 return 0;
> 
> -       if (!adev->gmc.xgmi.pending_reset &&
> +       if ((adev->init_lvl->level != AMDGPU_INIT_LEVEL_MINIMAL) &&
>             amdgpu_device_ip_get_ip_block(adev, AMD_IP_BLOCK_TYPE_PSP)) {
>                 ret = psp_xgmi_initialize(&adev->psp, false, true);
>                 if (ret) {
> @@ -907,7 +907,7 @@ int amdgpu_xgmi_add_device(struct amdgpu_device *adev)
> 
>         task_barrier_add_task(&hive->tb);
> 
> -       if (!adev->gmc.xgmi.pending_reset &&
> +       if ((adev->init_lvl->level != AMDGPU_INIT_LEVEL_MINIMAL) &&
>             amdgpu_device_ip_get_ip_block(adev, AMD_IP_BLOCK_TYPE_PSP)) {
>                 list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head) {
>                         /* update node list for other device in the hive */ @@ -985,7 +985,7 @@ int amdgpu_xgmi_add_device(struct amdgpu_device *adev)
>                 }
>         }
> 
> -       if (!ret && !adev->gmc.xgmi.pending_reset)
> +       if (!ret && (adev->init_lvl->level != AMDGPU_INIT_LEVEL_MINIMAL))
>                 ret = amdgpu_xgmi_sysfs_add_dev_info(adev, hive);
> 
>  exit_unlock:
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c b/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c
> index 16fcd9dcd202..7225f63c26b4 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c
> @@ -1616,7 +1616,8 @@ int smu_v11_0_baco_set_state(struct smu_context *smu, enum smu_baco_state state)
>                         break;
>                 default:
>                         if (!ras || !adev->ras_enabled ||
> -                           adev->gmc.xgmi.pending_reset) {
> +                           (adev->init_lvl->level !=
> +                            AMDGPU_INIT_LEVEL_MINIMAL)) {
> 
> 
> [Feifei] Is it a typo? If adev->init_lvl->level taking place of adev->gmc.xgmi.pending_reset, here should be(adev->init_lvl->level == AMDGPU_INIT_LEVEL_MINIMAL)

Yes, this is incorrect. Thanks for catching.

Thanks,
Lijo
> 
>                                 if (amdgpu_ip_version(adev, MP1_HWIP, 0) ==
>                                     IP_VERSION(11, 0, 2)) {
>                                         data = RREG32_SOC15(THM, 0, mmTHM_BACO_CNTL_ARCT);
> --
> 2.25.1
> 


More information about the amd-gfx mailing list