[PATCH 1/3] drm/amdgpu: Add gpu_coredump parameter

Huang, Trigger Trigger.Huang at amd.com
Fri Aug 16 06:39:04 UTC 2024


[AMD Official Use Only - AMD Internal Distribution Only]

> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of
> Huang, Trigger
> Sent: Friday, August 16, 2024 2:36 PM
> To: Alex Deucher <alexdeucher at gmail.com>
> Cc: amd-gfx at lists.freedesktop.org; Khatri, Sunil <Sunil.Khatri at amd.com>;
> Deucher, Alexander <Alexander.Deucher at amd.com>
> Subject: RE: [PATCH 1/3] drm/amdgpu: Add gpu_coredump parameter
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> > -----Original Message-----
> > From: Alex Deucher <alexdeucher at gmail.com>
> > Sent: Friday, August 16, 2024 12:02 AM
> > To: Huang, Trigger <Trigger.Huang at amd.com>
> > Cc: amd-gfx at lists.freedesktop.org; Khatri, Sunil
> > <Sunil.Khatri at amd.com>; Deucher, Alexander
> <Alexander.Deucher at amd.com>
> > Subject: Re: [PATCH 1/3] drm/amdgpu: Add gpu_coredump parameter
> >
> > On Thu, Aug 15, 2024 at 7:39 AM <Trigger.Huang at amd.com> wrote:
> > >
> > > From: Trigger Huang <Trigger.Huang at amd.com>
> > >
> > > Add new separate parameter to control GPU coredump procedure. This
> > > can be used to decouple the coredump procedure from gpu recovery
> > > procedure
> > >
> > > Signed-off-by: Trigger Huang <Trigger.Huang at amd.com>
> > > ---
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 8 ++++++++
> > >  2 files changed, 9 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > index 937de21a7142..4dd465ad14af 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > @@ -201,6 +201,7 @@ extern uint amdgpu_force_long_training;  extern
> > > int amdgpu_lbpw;  extern int amdgpu_compute_multipipe;  extern int
> > > amdgpu_gpu_recovery;
> > > +extern int amdgpu_gpu_coredump;
> > >  extern int amdgpu_emu_mode;
> > >  extern uint amdgpu_smu_memory_pool_size;  extern int
> > > amdgpu_smu_pptable_id; diff --git
> > > a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > index b9529948f2b2..c5d357420236 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > @@ -178,6 +178,7 @@ uint amdgpu_force_long_training;  int
> > amdgpu_lbpw
> > > = -1;  int amdgpu_compute_multipipe = -1;  int amdgpu_gpu_recovery =
> > > -1; /* auto */
> > > +int amdgpu_gpu_coredump;
> > >  int amdgpu_emu_mode;
> > >  uint amdgpu_smu_memory_pool_size;
> > >  int amdgpu_smu_pptable_id = -1;
> > > @@ -556,6 +557,13 @@ module_param_named(compute_multipipe,
> > > amdgpu_compute_multipipe, int, 0444);
> > MODULE_PARM_DESC(gpu_recovery,
> > > "Enable GPU recovery mechanism, (1 = enable, 0 = disable, -1 =
> > > auto)");  module_param_named(gpu_recovery, amdgpu_gpu_recovery,
> int,
> > > 0444);
> > >
> > > +/**
> > > + * DOC: gpu_coredump (int)
> > > + * Set to enable GPU coredump mechanism (1 = enable, 0 = disable).
> > > +The default is 0  */ MODULE_PARM_DESC(gpu_coredump, "Enable GPU
> > > +coredump mechanism, (1 = enable, 0 = disable (default))");
> > > +module_param_named(gpu_coredump, amdgpu_gpu_coredump, int,
> > 0444);
> >
> > I don't think we need a separate parameter for this, although if we
> > do, this would need to be enabled by default.  If it needs to be
> > decoupled from reset, that's fine, but I don't see the need for a separate
> knob.
> >
> > Alex
>
> Hi Alex,
> It is fine to enable it by default
> There are several application scenarios that I can think of.
>         1, Customer may need to do the core dump with gpu_recovery disabled.
> This can be used for GPU hang debug
>         2, Customer may need to disable the core dump with gpu_recovery
> enabled. This can be used for quick GPU recovery, especially for some
> lightweight hangs that can be processed by soft recovery or per ring reset.
>         3, Customer may need to enable the core dump with gpu_recovery
> enabled. This can be used for GPU recovery but record the core dump for
> further check in stress test or system health check.
> It seems not easy to support all the scenarios by only using
> amdgpu_gpu_coredump, right?

Sorry, this is a typo, here I mean amdgpu_gpu_recovery, while not amdgpu_gpu_coredump.

>
> Regards,
> Trigger
> >
> > > +
> > >  /**
> > >   * DOC: emu_mode (int)
> > >   * Set value 1 to enable emulation mode. This is only needed when
> > > running
> > on an emulator. The default is 0 (disabled).
> > > --
> > > 2.34.1
> > >


More information about the amd-gfx mailing list