[PATCH 2/2] drm/amdgpu: Do core dump immediately when job tmo

Tue Aug 20 15:29:17 UTC 2024

On Tue, Aug 20, 2024 at 11:07 AM Khatri, Sunil <sunil.khatri at amd.com> wrote:
>
>
> On 8/20/2024 7:36 PM, Alex Deucher wrote:
> > On Tue, Aug 20, 2024 at 3:30 AM Huang, Trigger <Trigger.Huang at amd.com> wrote:
> >> [AMD Official Use Only - AMD Internal Distribution Only]
> >>
> >>> -----Original Message-----
> >>> From: Khatri, Sunil <Sunil.Khatri at amd.com>
> >>> Sent: Monday, August 19, 2024 6:31 PM
> >>> To: Huang, Trigger <Trigger.Huang at amd.com>; amd-gfx at lists.freedesktop.org
> >>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>
> >>> Subject: Re: [PATCH 2/2] drm/amdgpu: Do core dump immediately when job
> >>> tmo
> >>>
> >>>
> >>> On 8/19/2024 3:23 PM, Trigger.Huang at amd.com wrote:
> >>>> From: Trigger Huang <Trigger.Huang at amd.com>
> >>>>
> >>>> Do the coredump immediately after a job timeout to get a closer
> >>>> representation of GPU's error status.
> >>>>
> >>>> V2: This will skip printing vram_lost as the GPU reset is not happened
> >>>> yet (Alex)
> >>>>
> >>>> V3: Unconditionally call the core dump as we care about all the reset
> >>>> functions(soft-recovery and queue reset and full adapter reset, Alex)
> >>>>
> >>>> Signed-off-by: Trigger Huang <Trigger.Huang at amd.com>
> >>>> ---
> >>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 62
> >>> +++++++++++++++++++++++++
> >>>>    1 file changed, 62 insertions(+)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>>> index c6a1783fc9ef..ebbb1434073e 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>>> @@ -30,6 +30,61 @@
> >>>>    #include "amdgpu.h"
> >>>>    #include "amdgpu_trace.h"
> >>>>    #include "amdgpu_reset.h"
> >>>> +#include "amdgpu_dev_coredump.h"
> >>>> +#include "amdgpu_xgmi.h"
> >>>> +
> >>>> +static void amdgpu_job_do_core_dump(struct amdgpu_device *adev,
> >>>> +                               struct amdgpu_job *job)
> >>>> +{
> >>>> +   int i;
> >>>> +
> >>>> +   dev_info(adev->dev, "Dumping IP State\n");
> >>>> +   for (i = 0; i < adev->num_ip_blocks; i++) {
> >>>> +           if (adev->ip_blocks[i].version->funcs->dump_ip_state)
> >>>> +                   adev->ip_blocks[i].version->funcs
> >>>> +                           ->dump_ip_state((void *)adev);
> >>>> +           dev_info(adev->dev, "Dumping IP State Completed\n");
> >>>> +   }
> >>>> +
> >>>> +   amdgpu_coredump(adev, true, false, job); }
> >>>> +
> >>>> +static void amdgpu_job_core_dump(struct amdgpu_device *adev,
> >>>> +                            struct amdgpu_job *job)
> >>>> +{
> >>>> +   struct list_head device_list, *device_list_handle =  NULL;
> >>>> +   struct amdgpu_device *tmp_adev = NULL;
> >>>> +   struct amdgpu_hive_info *hive = NULL;
> >>>> +
> >>>> +   if (!amdgpu_sriov_vf(adev))
> >>>> +           hive = amdgpu_get_xgmi_hive(adev);
> >>>> +   if (hive)
> >>>> +           mutex_lock(&hive->hive_lock);
> >>>> +   /*
> >>>> +    * Reuse the logic in amdgpu_device_gpu_recover() to build list of
> >>>> +    * devices for code dump
> >>>> +    */
> >>>> +   INIT_LIST_HEAD(&device_list);
> >>>> +   if (!amdgpu_sriov_vf(adev) && (adev-
> >>>> gmc.xgmi.num_physical_nodes > 1) && hive) {
> >>>> +           list_for_each_entry(tmp_adev, &hive->device_list,
> >>> gmc.xgmi.head)
> >>>> +                   list_add_tail(&tmp_adev->reset_list, &device_list);
> >>>> +           if (!list_is_first(&adev->reset_list, &device_list))
> >>>> +                   list_rotate_to_front(&adev->reset_list, &device_list);
> >>>> +           device_list_handle = &device_list;
> >>>> +   } else {
> >>>> +           list_add_tail(&adev->reset_list, &device_list);
> >>>> +           device_list_handle = &device_list;
> >>>> +   }
> >>>> +
> >>>> +   /* Do the coredump for each device */
> >>>> +   list_for_each_entry(tmp_adev, device_list_handle, reset_list)
> >>>> +           amdgpu_job_do_core_dump(tmp_adev, job);
> >>>> +
> >>>> +   if (hive) {
> >>>> +           mutex_unlock(&hive->hive_lock);
> >>>> +           amdgpu_put_xgmi_hive(hive);
> >>>> +   }
> >>>> +}
> >>>>
> >>>>    static enum drm_gpu_sched_stat amdgpu_job_timedout(struct
> >>> drm_sched_job *s_job)
> >>>>    {
> >>>> @@ -48,6 +103,7 @@ static enum drm_gpu_sched_stat
> >>> amdgpu_job_timedout(struct drm_sched_job *s_job)
> >>>>              return DRM_GPU_SCHED_STAT_ENODEV;
> >>>>      }
> >>>>
> >>>> +   amdgpu_job_core_dump(adev, job);
> >>> The philosophy is hang and recovery is to let the HW and software try to
> >>> recover. Here we try to do a soft recovery first and i think we should wait for
> >>> seft recovery and if fails then we do dump and thats exactly we are doing here.
> >> Hi Sunil ,
> >> thanks for the suggestion, and that's reasonable. But my concern is that after soft recovery happened, the GPU's status may change(take gfx 9 for example, it will try to kill the current hang wave)
> >>   Actually, in most cases, a real shader hang cannot be resolved through soft recovery, and at that moment, we need to get a very close dump/snapshot/representation of GPU's current error status.
> >> Just like the scandump, when we trying to do a scandump for a shader hang, we will disable gpu_recovery, and no soft recovery/per-queue reset/HW reset will happen before the scandump, right?
> >> On most products, there are no scandump interfaces, so core dump is even more important for debugging GPU hang issue.
> > I agree that it makes sense to take the dump as early as possible.  It
> > makes sense to move it up now that we are starting to have finer
> > grained resets.  We may want to wire devcoredump into the KFD side as
> > well at some point.
> In the current implementation we do stop the Scheduler and wait for
> existing fences to complete or signal them. But the new place to dump
> will not have anything like that and even though we have timeout some of
> the threads or waves might be still progressing the IP dump might be
> changing during that time.

Sure, but that is all sw state.  It shouldn't affect the hw state.  Or
if it does, it would be indirect.  E.g., signalling a fence might
unblock a dependent job, but the root issue would still be valid.

>
> But i am not sure if we want to stop the scheduling of new tasks and end
> the existing one.  How about we have multiple level of dumps i.e just
> capture and not dump as first dump is what is captured not last.
> a. Capture after soft reset and let it be overwritten by next level
> b. After soft reset fails capture again before hard reset is triggered.
>
> So eventually we would have the ip dump file generated of what we are
> interested in.

It might make more sense to slice and dice the dumps more finely, but
I think we'll need to see how things work out as is before we worry
about those details.

Alex

>
> Regards
> Sunil K
>
>
>
> >
> > Alex
> >
> >> Regards,
> >> Trigger
> >>
> >>> Also we need to make sure that the tasks which are already in queue are put
> >>> on hold and the their sync points are signalled before we dump.
> >>> check once what all steps are taken before we dump in the current
> >>> implementation.
> >> Do you mean sometimes like:
> >>          drm_sched_wqueue_stop(&ring->sched);
> >>          amdgpu_fence_driver_force_completion(ring); // Since there is no GPU reset happened, is it reasonable to call it here?
> >>          amdgpu_job_core_dump(adev, job);
> >>
> >>
> >> Regards,
> >> Trigger
> >>
> >>> Regards
> >>>
> >>> Sunil khatri
> >>>
> >>>>      adev->job_hang = true;
> >>>>
> >>>> @@ -101,6 +157,12 @@ static enum drm_gpu_sched_stat
> >>> amdgpu_job_timedout(struct drm_sched_job *s_job)
> >>>>              reset_context.src = AMDGPU_RESET_SRC_JOB;
> >>>>              clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
> >>>>
> >>>> +           /*
> >>>> +            * To avoid an unnecessary extra coredump, as we have
> >>> already
> >>>> +            * got the very close representation of GPU's error status
> >>>> +            */
> >>>> +           set_bit(AMDGPU_SKIP_COREDUMP, &reset_context.flags);
> >>>> +
> >>>>              r = amdgpu_device_gpu_recover(ring->adev, job,
> >>> &reset_context);
> >>>>              if (r)
> >>>>                      dev_err(adev->dev, "GPU Recovery Failed: %d\n", r);