[PATCH 2/2] drm/amdgpu: Do core dump immediately when job tmo
Huang, Trigger
Trigger.Huang at amd.com
Tue Aug 20 07:30:44 UTC 2024
[AMD Official Use Only - AMD Internal Distribution Only]
> -----Original Message-----
> From: Khatri, Sunil <Sunil.Khatri at amd.com>
> Sent: Monday, August 19, 2024 6:31 PM
> To: Huang, Trigger <Trigger.Huang at amd.com>; amd-gfx at lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>
> Subject: Re: [PATCH 2/2] drm/amdgpu: Do core dump immediately when job
> tmo
>
>
> On 8/19/2024 3:23 PM, Trigger.Huang at amd.com wrote:
> > From: Trigger Huang <Trigger.Huang at amd.com>
> >
> > Do the coredump immediately after a job timeout to get a closer
> > representation of GPU's error status.
> >
> > V2: This will skip printing vram_lost as the GPU reset is not happened
> > yet (Alex)
> >
> > V3: Unconditionally call the core dump as we care about all the reset
> > functions(soft-recovery and queue reset and full adapter reset, Alex)
> >
> > Signed-off-by: Trigger Huang <Trigger.Huang at amd.com>
> > ---
> > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 62
> +++++++++++++++++++++++++
> > 1 file changed, 62 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > index c6a1783fc9ef..ebbb1434073e 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > @@ -30,6 +30,61 @@
> > #include "amdgpu.h"
> > #include "amdgpu_trace.h"
> > #include "amdgpu_reset.h"
> > +#include "amdgpu_dev_coredump.h"
> > +#include "amdgpu_xgmi.h"
> > +
> > +static void amdgpu_job_do_core_dump(struct amdgpu_device *adev,
> > + struct amdgpu_job *job)
> > +{
> > + int i;
> > +
> > + dev_info(adev->dev, "Dumping IP State\n");
> > + for (i = 0; i < adev->num_ip_blocks; i++) {
> > + if (adev->ip_blocks[i].version->funcs->dump_ip_state)
> > + adev->ip_blocks[i].version->funcs
> > + ->dump_ip_state((void *)adev);
> > + dev_info(adev->dev, "Dumping IP State Completed\n");
> > + }
> > +
> > + amdgpu_coredump(adev, true, false, job); }
> > +
> > +static void amdgpu_job_core_dump(struct amdgpu_device *adev,
> > + struct amdgpu_job *job)
> > +{
> > + struct list_head device_list, *device_list_handle = NULL;
> > + struct amdgpu_device *tmp_adev = NULL;
> > + struct amdgpu_hive_info *hive = NULL;
> > +
> > + if (!amdgpu_sriov_vf(adev))
> > + hive = amdgpu_get_xgmi_hive(adev);
> > + if (hive)
> > + mutex_lock(&hive->hive_lock);
> > + /*
> > + * Reuse the logic in amdgpu_device_gpu_recover() to build list of
> > + * devices for code dump
> > + */
> > + INIT_LIST_HEAD(&device_list);
> > + if (!amdgpu_sriov_vf(adev) && (adev-
> >gmc.xgmi.num_physical_nodes > 1) && hive) {
> > + list_for_each_entry(tmp_adev, &hive->device_list,
> gmc.xgmi.head)
> > + list_add_tail(&tmp_adev->reset_list, &device_list);
> > + if (!list_is_first(&adev->reset_list, &device_list))
> > + list_rotate_to_front(&adev->reset_list, &device_list);
> > + device_list_handle = &device_list;
> > + } else {
> > + list_add_tail(&adev->reset_list, &device_list);
> > + device_list_handle = &device_list;
> > + }
> > +
> > + /* Do the coredump for each device */
> > + list_for_each_entry(tmp_adev, device_list_handle, reset_list)
> > + amdgpu_job_do_core_dump(tmp_adev, job);
> > +
> > + if (hive) {
> > + mutex_unlock(&hive->hive_lock);
> > + amdgpu_put_xgmi_hive(hive);
> > + }
> > +}
> >
> > static enum drm_gpu_sched_stat amdgpu_job_timedout(struct
> drm_sched_job *s_job)
> > {
> > @@ -48,6 +103,7 @@ static enum drm_gpu_sched_stat
> amdgpu_job_timedout(struct drm_sched_job *s_job)
> > return DRM_GPU_SCHED_STAT_ENODEV;
> > }
> >
> > + amdgpu_job_core_dump(adev, job);
> The philosophy is hang and recovery is to let the HW and software try to
> recover. Here we try to do a soft recovery first and i think we should wait for
> seft recovery and if fails then we do dump and thats exactly we are doing here.
Hi Sunil ,
thanks for the suggestion, and that's reasonable. But my concern is that after soft recovery happened, the GPU's status may change(take gfx 9 for example, it will try to kill the current hang wave)
Actually, in most cases, a real shader hang cannot be resolved through soft recovery, and at that moment, we need to get a very close dump/snapshot/representation of GPU's current error status.
Just like the scandump, when we trying to do a scandump for a shader hang, we will disable gpu_recovery, and no soft recovery/per-queue reset/HW reset will happen before the scandump, right?
On most products, there are no scandump interfaces, so core dump is even more important for debugging GPU hang issue.
Regards,
Trigger
>
> Also we need to make sure that the tasks which are already in queue are put
> on hold and the their sync points are signalled before we dump.
> check once what all steps are taken before we dump in the current
> implementation.
Do you mean sometimes like:
drm_sched_wqueue_stop(&ring->sched);
amdgpu_fence_driver_force_completion(ring); // Since there is no GPU reset happened, is it reasonable to call it here?
amdgpu_job_core_dump(adev, job);
Regards,
Trigger
>
> Regards
>
> Sunil khatri
>
> >
> > adev->job_hang = true;
> >
> > @@ -101,6 +157,12 @@ static enum drm_gpu_sched_stat
> amdgpu_job_timedout(struct drm_sched_job *s_job)
> > reset_context.src = AMDGPU_RESET_SRC_JOB;
> > clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
> >
> > + /*
> > + * To avoid an unnecessary extra coredump, as we have
> already
> > + * got the very close representation of GPU's error status
> > + */
> > + set_bit(AMDGPU_SKIP_COREDUMP, &reset_context.flags);
> > +
> > r = amdgpu_device_gpu_recover(ring->adev, job,
> &reset_context);
> > if (r)
> > dev_err(adev->dev, "GPU Recovery Failed: %d\n", r);
More information about the amd-gfx
mailing list