[RFC v2 00/15] Enhance device state machine to better support suspend/resume

Thu Jan 23 00:02:55 UTC 2025

Is the latest version of this patch series (with possible fixes based on to
comments) however maintained/available on some git tree for testing?

On Sun, Jan 19, 2025 at 10:28 PM Zhang, Hawking <Hawking.Zhang at amd.com>
wrote:

> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Thanks for the patches.
>
> We currently have no plans to include RAS programming as part of the
> device suspend/resume sequence. Instead, we are focusing on a series of
> clean up patches and a new RAS software module that will eliminate all
> legacy code/workarounds and also the changes you proposed here. It is not
> necessary to make this interim change in the upstream.
>
> Regards,
> Hawking
>
> -----Original Message-----
> From: Jiang Liu <gerry at linux.alibaba.com>
> Sent: Monday, January 13, 2025 09:42
> To: Deucher, Alexander <Alexander.Deucher at amd.com>; Koenig, Christian <
> Christian.Koenig at amd.com>; Pan, Xinhui <Xinhui.Pan at amd.com>;
> airlied at gmail.com; simona at ffwll.ch; Khatri, Sunil <Sunil.Khatri at amd.com>;
> Lazar, Lijo <Lijo.Lazar at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>;
> Limonciello, Mario <Mario.Limonciello at amd.com>; Chen, Xiaogang <
> Xiaogang.Chen at amd.com>; Russell, Kent <Kent.Russell at amd.com>;
> shuox.liu at linux.alibaba.com; amd-gfx at lists.freedesktop.org
> Cc: Jiang Liu <gerry at linux.alibaba.com>
> Subject: [RFC v2 00/15] Enhance device state machine to better support
> suspend/resume
>
> Recently we were testing suspend/resume functionality with AMD GPUs, we
> have encountered several resource tracking related bugs, such as double
> buffer free, use after free and unbalanced irq reference count.
>
> We have tried to solve these issues case by case, but found that may not
> be the right way. Especially about the unbalanced irq reference count,
> there will be new issues appear once we fixed the current known issues.
> After analyzing related source code, we found that there may be some
> fundamental implementation flaws behind these resource tracking issues.
>
> The amdgpu driver has two major state machines to driver the device
> management flow, one is for ip blocks, the other is for ras blocks.
> The hook points defined in struct amd_ip_funcs for device setup/teardown
> are symmetric, but the implementation is asymmetric, sometime even
> ambiguous. The most obvious two issues we noticed are:
> 1) amdgpu_irq_get() are called from .late_init() but amdgpu_irq_put()
>    are called from .hw_fini() instead of .early_fini().
> 2) the way to reset ip_bloc.status.valid/sw/hw/late_initialized doesn't
>    match the way to set those flags.
>
> When taking device suspend/resume into account, in addition to device
> probe/remove, things get much more complex. Some issues arise because many
> suspend/resume implementations directly reuse .hw_init/.hw_fini/ .late_init
> hook points.
>
> So we try to fix those issues by two enhancements/refinements to current
> device management state machines.
>
> The first change is to make the ip block state machine and associated
> status flags work in stack-like way as below:
> Callbacks                    State after successfully execute callback
>                              AMDGPU_IP_STATE_INVALID
> .early_init()                AMDGPU_IP_STATE_EARLY
> .sw_init()                   AMDGPU_IP_STATE_SW
> .hw_init()                   AMDGPU_IP_STATE_HW
> .late_init()                 AMDGPU_IP_STATE_LATE
> .early_fini()                AMDGPU_IP_STATE_HW
> .hw_fini()                   AMDGPU_IP_STATE_SW
> .sw_fini()                   AMDGPU_IP_STATE_EARLY
> .late_fini()                 AMDGPU_IP_STATE_INVALID
>
> Also do the same thing for ras block state machine, though it's much more
> simpler.
>
> The second change is fine tune the overall device management work flow as
> below:
> 1. amdgpu_driver_load_kms()
>         amdgpu_device_init()
>                 amdgpu_device_ip_early_init()
>                         ip_blocks[i].early_init()
>                         ip_blocks[i].status.valid = true
>                 amdgpu_device_ip_init()
>                         amdgpu_ras_init()
>                         ip_blocks[i].sw_init()
>                         ip_blocks[i].status.sw = true
>                         ip_blocks[i].hw_init()
>                         ip_blocks[i].status.hw = true
>                 amdgpu_device_ip_late_init()
>                         ip_blocks[i].late_init()
>                         ip_blocks[i].status.late_initialized = true
>                         amdgpu_ras_late_init()
>                                 ras_blocks[i].ras_late_init()
>                                         amdgpu_ras_feature_enable_on_boot()
>
> 2. amdgpu_pmops_suspend()/amdgpu_pmops_freeze()/amdgpu_pmops_poweroff()
>         amdgpu_device_suspend()
>                 amdgpu_ras_early_fini()
>                         ras_blocks[i].ras_early_fini()
>                                 amdgpu_ras_feature_disable()
>                 amdgpu_ras_suspend()
>                         amdgpu_ras_disable_all_features()
> +++             ip_blocks[i].early_fini()
> +++             ip_blocks[i].status.late_initialized = false
>                 ip_blocks[i].suspend()
>
> 3. amdgpu_pmops_resume()/amdgpu_pmops_thaw()/amdgpu_pmops_restore()
>         amdgpu_device_resume()
>                 amdgpu_device_ip_resume()
>                         ip_blocks[i].resume()
>                 amdgpu_device_ip_late_init()
>                         ip_blocks[i].late_init()
>                         ip_blocks[i].status.late_initialized = true
>                         amdgpu_ras_late_init()
>                                 ras_blocks[i].ras_late_init()
>                                         amdgpu_ras_feature_enable_on_boot()
>                 amdgpu_ras_resume()
>                         amdgpu_ras_enable_all_features()
>
> 4. amdgpu_driver_unload_kms()
>         amdgpu_device_fini_hw()
>                 amdgpu_ras_early_fini()
>                         ras_blocks[i].ras_early_fini()
> +++             ip_blocks[i].early_fini()
> +++             ip_blocks[i].status.late_initialized = false
>                 ip_blocks[i].hw_fini()
>                 ip_blocks[i].status.hw = false
>
> 5. amdgpu_driver_release_kms()
>         amdgpu_device_fini_sw()
>                 amdgpu_device_ip_fini()
>                         ip_blocks[i].sw_fini()
>                         ip_blocks[i].status.sw = false
> ---                     ip_blocks[i].status.valid = false
> +++                     amdgpu_ras_fini()
>                         ip_blocks[i].late_fini()
> +++                     ip_blocks[i].status.valid = false
> ---                     ip_blocks[i].status.late_initialized = false
> ---                     amdgpu_ras_fini()
>
> The main changes include:
> 1) invoke ip_blocks[i].early_fini in amdgpu_pmops_suspend().
>    Currently there's only one ip block which provides `early_fini`
>    callback. We have add a check of `in_s3` to keep current behavior in
>    function amdgpu_dm_early_fini(). So there should be no functional
>    changes.
> 2) set ip_blocks[i].status.late_initialized to false after calling
>    callback `early_fini`. We have auditted all usages of the
>    late_initialized flag and no functional changes found.
> 3) only set ip_blocks[i].status.valid = false after calling the
>    `late_fini` callback.
> 4) call amdgpu_ras_fini() before invoking ip_blocks[i].late_fini.
>
> Then we try to refine each subsystem, such as nbio, asic etc, to follow
> the new design. Currently we have only taken the nbio and asic as examples
> to show the proposed changes. Once we have confirmed that's the right way
> to go, we will handle the lefting subsystems.
>
> This is in early stage and requesting for comments, any comments and
> suggestions are welcomed!
>
>
> v2:
> - remove patch 1 in v1, it already got merged
> - convert status bool flags for ip block into enum
> - introduce iterators to walk ip blocks
> - refine the way to define status markers
> - split amdgpu_dm related change into a dedicated patch
> - add patch 13 to walk ip blocks in reverse order when shutdown
>
> Jiang Liu (15):
>   drm/amdgpu: add helper functions to track status for ras manager
>   drm/amdgpu: add a flag to track ras debugfs creation status
>   drm/amdgpu: free all resources on error recovery path of
>     amdgpu_ras_init()
>   drm/amdgpu: introduce a flag to track refcount held for features
>   drm/amdgpu: enhance amdgpu_ras_block_late_fini()
>   drm/amdgpu: enhance amdgpu_ras_pre_fini() to better support SR
>   drm/admgpu: rename amdgpu_ras_pre_fini() to amdgpu_ras_early_fini()
>   drm/amdgpu: make IP block state machine works in stack like way
>   drm/amdgpu_dm: enhance amdgpu_dm_early_fini() for PM ops
>   drm/admgpu: make device state machine work in stack like way
>   drm/amdgpu: convert ip block bool flags into an enum
>   drm/amdgpu: introduce IP block iterators to reduce duplicated code
>   drm/amdgpu: walk IP blocks in reverse order when shutdown
>   drm/amdgpu/nbio: improve the way to manage irq reference count
>   drm/amdgpu/asic: make ip block operations symmetric by .early_fini()
>
>  drivers/gpu/drm/amd/amdgpu/aldebaran.c        |  46 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h           | 109 +++-
>  .../gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c  |   3 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 504 +++++++++---------
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c       |   2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       |  10 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c      |   2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c       |  18 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_nbio.c      |  16 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_nbio.h      |   1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c       |   4 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c       | 142 +++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h       |  16 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c     |  14 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c      |   2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c       |   2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c       |   2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c      |   2 +-
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c       |   2 +-
>  drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c      |   2 +-
>  drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c       |   2 +-
>  drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c        |   1 +
>  drivers/gpu/drm/amd/amdgpu/nbio_v7_9.c        |   1 +
>  drivers/gpu/drm/amd/amdgpu/nv.c               |  14 +-
>  drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c   |  50 +-
>  drivers/gpu/drm/amd/amdgpu/smu_v13_0_10.c     |  51 +-
>  drivers/gpu/drm/amd/amdgpu/soc15.c            |  38 +-
>  drivers/gpu/drm/amd/amdgpu/soc21.c            |  35 +-
>  drivers/gpu/drm/amd/amdgpu/soc24.c            |  22 +-
>  drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c       |   2 +-
>  .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c |   3 +
>  drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c     |  10 +-
>  32 files changed, 668 insertions(+), 460 deletions(-)
>
> --
> 2.43.5
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20250122/fe089481/attachment-0001.htm>