[RFC v2 00/15] Enhance device state machine to better support suspend/resume
Mika Laitio
lamikr at gmail.com
Thu Jan 23 00:02:55 UTC 2025
Is the latest version of this patch series (with possible fixes based on to
comments) however maintained/available on some git tree for testing?
On Sun, Jan 19, 2025 at 10:28 PM Zhang, Hawking <Hawking.Zhang at amd.com>
wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Thanks for the patches.
>
> We currently have no plans to include RAS programming as part of the
> device suspend/resume sequence. Instead, we are focusing on a series of
> clean up patches and a new RAS software module that will eliminate all
> legacy code/workarounds and also the changes you proposed here. It is not
> necessary to make this interim change in the upstream.
>
> Regards,
> Hawking
>
> -----Original Message-----
> From: Jiang Liu <gerry at linux.alibaba.com>
> Sent: Monday, January 13, 2025 09:42
> To: Deucher, Alexander <Alexander.Deucher at amd.com>; Koenig, Christian <
> Christian.Koenig at amd.com>; Pan, Xinhui <Xinhui.Pan at amd.com>;
> airlied at gmail.com; simona at ffwll.ch; Khatri, Sunil <Sunil.Khatri at amd.com>;
> Lazar, Lijo <Lijo.Lazar at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>;
> Limonciello, Mario <Mario.Limonciello at amd.com>; Chen, Xiaogang <
> Xiaogang.Chen at amd.com>; Russell, Kent <Kent.Russell at amd.com>;
> shuox.liu at linux.alibaba.com; amd-gfx at lists.freedesktop.org
> Cc: Jiang Liu <gerry at linux.alibaba.com>
> Subject: [RFC v2 00/15] Enhance device state machine to better support
> suspend/resume
>
> Recently we were testing suspend/resume functionality with AMD GPUs, we
> have encountered several resource tracking related bugs, such as double
> buffer free, use after free and unbalanced irq reference count.
>
> We have tried to solve these issues case by case, but found that may not
> be the right way. Especially about the unbalanced irq reference count,
> there will be new issues appear once we fixed the current known issues.
> After analyzing related source code, we found that there may be some
> fundamental implementation flaws behind these resource tracking issues.
>
> The amdgpu driver has two major state machines to driver the device
> management flow, one is for ip blocks, the other is for ras blocks.
> The hook points defined in struct amd_ip_funcs for device setup/teardown
> are symmetric, but the implementation is asymmetric, sometime even
> ambiguous. The most obvious two issues we noticed are:
> 1) amdgpu_irq_get() are called from .late_init() but amdgpu_irq_put()
> are called from .hw_fini() instead of .early_fini().
> 2) the way to reset ip_bloc.status.valid/sw/hw/late_initialized doesn't
> match the way to set those flags.
>
> When taking device suspend/resume into account, in addition to device
> probe/remove, things get much more complex. Some issues arise because many
> suspend/resume implementations directly reuse .hw_init/.hw_fini/ .late_init
> hook points.
>
> So we try to fix those issues by two enhancements/refinements to current
> device management state machines.
>
> The first change is to make the ip block state machine and associated
> status flags work in stack-like way as below:
> Callbacks State after successfully execute callback
> AMDGPU_IP_STATE_INVALID
> .early_init() AMDGPU_IP_STATE_EARLY
> .sw_init() AMDGPU_IP_STATE_SW
> .hw_init() AMDGPU_IP_STATE_HW
> .late_init() AMDGPU_IP_STATE_LATE
> .early_fini() AMDGPU_IP_STATE_HW
> .hw_fini() AMDGPU_IP_STATE_SW
> .sw_fini() AMDGPU_IP_STATE_EARLY
> .late_fini() AMDGPU_IP_STATE_INVALID
>
> Also do the same thing for ras block state machine, though it's much more
> simpler.
>
> The second change is fine tune the overall device management work flow as
> below:
> 1. amdgpu_driver_load_kms()
> amdgpu_device_init()
> amdgpu_device_ip_early_init()
> ip_blocks[i].early_init()
> ip_blocks[i].status.valid = true
> amdgpu_device_ip_init()
> amdgpu_ras_init()
> ip_blocks[i].sw_init()
> ip_blocks[i].status.sw = true
> ip_blocks[i].hw_init()
> ip_blocks[i].status.hw = true
> amdgpu_device_ip_late_init()
> ip_blocks[i].late_init()
> ip_blocks[i].status.late_initialized = true
> amdgpu_ras_late_init()
> ras_blocks[i].ras_late_init()
> amdgpu_ras_feature_enable_on_boot()
>
> 2. amdgpu_pmops_suspend()/amdgpu_pmops_freeze()/amdgpu_pmops_poweroff()
> amdgpu_device_suspend()
> amdgpu_ras_early_fini()
> ras_blocks[i].ras_early_fini()
> amdgpu_ras_feature_disable()
> amdgpu_ras_suspend()
> amdgpu_ras_disable_all_features()
> +++ ip_blocks[i].early_fini()
> +++ ip_blocks[i].status.late_initialized = false
> ip_blocks[i].suspend()
>
> 3. amdgpu_pmops_resume()/amdgpu_pmops_thaw()/amdgpu_pmops_restore()
> amdgpu_device_resume()
> amdgpu_device_ip_resume()
> ip_blocks[i].resume()
> amdgpu_device_ip_late_init()
> ip_blocks[i].late_init()
> ip_blocks[i].status.late_initialized = true
> amdgpu_ras_late_init()
> ras_blocks[i].ras_late_init()
> amdgpu_ras_feature_enable_on_boot()
> amdgpu_ras_resume()
> amdgpu_ras_enable_all_features()
>
> 4. amdgpu_driver_unload_kms()
> amdgpu_device_fini_hw()
> amdgpu_ras_early_fini()
> ras_blocks[i].ras_early_fini()
> +++ ip_blocks[i].early_fini()
> +++ ip_blocks[i].status.late_initialized = false
> ip_blocks[i].hw_fini()
> ip_blocks[i].status.hw = false
>
> 5. amdgpu_driver_release_kms()
> amdgpu_device_fini_sw()
> amdgpu_device_ip_fini()
> ip_blocks[i].sw_fini()
> ip_blocks[i].status.sw = false
> --- ip_blocks[i].status.valid = false
> +++ amdgpu_ras_fini()
> ip_blocks[i].late_fini()
> +++ ip_blocks[i].status.valid = false
> --- ip_blocks[i].status.late_initialized = false
> --- amdgpu_ras_fini()
>
> The main changes include:
> 1) invoke ip_blocks[i].early_fini in amdgpu_pmops_suspend().
> Currently there's only one ip block which provides `early_fini`
> callback. We have add a check of `in_s3` to keep current behavior in
> function amdgpu_dm_early_fini(). So there should be no functional
> changes.
> 2) set ip_blocks[i].status.late_initialized to false after calling
> callback `early_fini`. We have auditted all usages of the
> late_initialized flag and no functional changes found.
> 3) only set ip_blocks[i].status.valid = false after calling the
> `late_fini` callback.
> 4) call amdgpu_ras_fini() before invoking ip_blocks[i].late_fini.
>
> Then we try to refine each subsystem, such as nbio, asic etc, to follow
> the new design. Currently we have only taken the nbio and asic as examples
> to show the proposed changes. Once we have confirmed that's the right way
> to go, we will handle the lefting subsystems.
>
> This is in early stage and requesting for comments, any comments and
> suggestions are welcomed!
>
>
> v2:
> - remove patch 1 in v1, it already got merged
> - convert status bool flags for ip block into enum
> - introduce iterators to walk ip blocks
> - refine the way to define status markers
> - split amdgpu_dm related change into a dedicated patch
> - add patch 13 to walk ip blocks in reverse order when shutdown
>
> Jiang Liu (15):
> drm/amdgpu: add helper functions to track status for ras manager
> drm/amdgpu: add a flag to track ras debugfs creation status
> drm/amdgpu: free all resources on error recovery path of
> amdgpu_ras_init()
> drm/amdgpu: introduce a flag to track refcount held for features
> drm/amdgpu: enhance amdgpu_ras_block_late_fini()
> drm/amdgpu: enhance amdgpu_ras_pre_fini() to better support SR
> drm/admgpu: rename amdgpu_ras_pre_fini() to amdgpu_ras_early_fini()
> drm/amdgpu: make IP block state machine works in stack like way
> drm/amdgpu_dm: enhance amdgpu_dm_early_fini() for PM ops
> drm/admgpu: make device state machine work in stack like way
> drm/amdgpu: convert ip block bool flags into an enum
> drm/amdgpu: introduce IP block iterators to reduce duplicated code
> drm/amdgpu: walk IP blocks in reverse order when shutdown
> drm/amdgpu/nbio: improve the way to manage irq reference count
> drm/amdgpu/asic: make ip block operations symmetric by .early_fini()
>
> drivers/gpu/drm/amd/amdgpu/aldebaran.c | 46 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 109 +++-
> .../gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c | 3 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 504 +++++++++---------
> drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 2 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 10 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c | 2 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 18 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_nbio.c | 16 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_nbio.h | 1 +
> drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 4 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 142 +++--
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 16 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c | 2 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 2 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 2 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 2 +-
> drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 2 +-
> drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c | 2 +-
> drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c | 2 +-
> drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c | 1 +
> drivers/gpu/drm/amd/amdgpu/nbio_v7_9.c | 1 +
> drivers/gpu/drm/amd/amdgpu/nv.c | 14 +-
> drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c | 50 +-
> drivers/gpu/drm/amd/amdgpu/smu_v13_0_10.c | 51 +-
> drivers/gpu/drm/amd/amdgpu/soc15.c | 38 +-
> drivers/gpu/drm/amd/amdgpu/soc21.c | 35 +-
> drivers/gpu/drm/amd/amdgpu/soc24.c | 22 +-
> drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 2 +-
> .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +
> drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 10 +-
> 32 files changed, 668 insertions(+), 460 deletions(-)
>
> --
> 2.43.5
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20250122/fe089481/attachment-0001.htm>
More information about the amd-gfx
mailing list