[PATCH 2/2] drm/amdgpu: Do core dump immediately when job tmo

Tue Aug 20 15:07:30 UTC 2024

On 8/20/2024 7:36 PM, Alex Deucher wrote:
> On Tue, Aug 20, 2024 at 3:30 AM Huang, Trigger <Trigger.Huang at amd.com> wrote:
>> [AMD Official Use Only - AMD Internal Distribution Only]
>>
>>> -----Original Message-----
>>> From: Khatri, Sunil <Sunil.Khatri at amd.com>
>>> Sent: Monday, August 19, 2024 6:31 PM
>>> To: Huang, Trigger <Trigger.Huang at amd.com>; amd-gfx at lists.freedesktop.org
>>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>
>>> Subject: Re: [PATCH 2/2] drm/amdgpu: Do core dump immediately when job
>>> tmo
>>>
>>>
>>> On 8/19/2024 3:23 PM, Trigger.Huang at amd.com wrote:
>>>> From: Trigger Huang <Trigger.Huang at amd.com>
>>>>
>>>> Do the coredump immediately after a job timeout to get a closer
>>>> representation of GPU's error status.
>>>>
>>>> V2: This will skip printing vram_lost as the GPU reset is not happened
>>>> yet (Alex)
>>>>
>>>> V3: Unconditionally call the core dump as we care about all the reset
>>>> functions(soft-recovery and queue reset and full adapter reset, Alex)
>>>>
>>>> Signed-off-by: Trigger Huang <Trigger.Huang at amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 62
>>> +++++++++++++++++++++++++
>>>>    1 file changed, 62 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> index c6a1783fc9ef..ebbb1434073e 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> @@ -30,6 +30,61 @@
>>>>    #include "amdgpu.h"
>>>>    #include "amdgpu_trace.h"
>>>>    #include "amdgpu_reset.h"
>>>> +#include "amdgpu_dev_coredump.h"
>>>> +#include "amdgpu_xgmi.h"
>>>> +
>>>> +static void amdgpu_job_do_core_dump(struct amdgpu_device *adev,
>>>> +                               struct amdgpu_job *job)
>>>> +{
>>>> +   int i;
>>>> +
>>>> +   dev_info(adev->dev, "Dumping IP State\n");
>>>> +   for (i = 0; i < adev->num_ip_blocks; i++) {
>>>> +           if (adev->ip_blocks[i].version->funcs->dump_ip_state)
>>>> +                   adev->ip_blocks[i].version->funcs
>>>> +                           ->dump_ip_state((void *)adev);
>>>> +           dev_info(adev->dev, "Dumping IP State Completed\n");
>>>> +   }
>>>> +
>>>> +   amdgpu_coredump(adev, true, false, job); }
>>>> +
>>>> +static void amdgpu_job_core_dump(struct amdgpu_device *adev,
>>>> +                            struct amdgpu_job *job)
>>>> +{
>>>> +   struct list_head device_list, *device_list_handle =  NULL;
>>>> +   struct amdgpu_device *tmp_adev = NULL;
>>>> +   struct amdgpu_hive_info *hive = NULL;
>>>> +
>>>> +   if (!amdgpu_sriov_vf(adev))
>>>> +           hive = amdgpu_get_xgmi_hive(adev);
>>>> +   if (hive)
>>>> +           mutex_lock(&hive->hive_lock);
>>>> +   /*
>>>> +    * Reuse the logic in amdgpu_device_gpu_recover() to build list of
>>>> +    * devices for code dump
>>>> +    */
>>>> +   INIT_LIST_HEAD(&device_list);
>>>> +   if (!amdgpu_sriov_vf(adev) && (adev-
>>>> gmc.xgmi.num_physical_nodes > 1) && hive) {
>>>> +           list_for_each_entry(tmp_adev, &hive->device_list,
>>> gmc.xgmi.head)
>>>> +                   list_add_tail(&tmp_adev->reset_list, &device_list);
>>>> +           if (!list_is_first(&adev->reset_list, &device_list))
>>>> +                   list_rotate_to_front(&adev->reset_list, &device_list);
>>>> +           device_list_handle = &device_list;
>>>> +   } else {
>>>> +           list_add_tail(&adev->reset_list, &device_list);
>>>> +           device_list_handle = &device_list;
>>>> +   }
>>>> +
>>>> +   /* Do the coredump for each device */
>>>> +   list_for_each_entry(tmp_adev, device_list_handle, reset_list)
>>>> +           amdgpu_job_do_core_dump(tmp_adev, job);
>>>> +
>>>> +   if (hive) {
>>>> +           mutex_unlock(&hive->hive_lock);
>>>> +           amdgpu_put_xgmi_hive(hive);
>>>> +   }
>>>> +}
>>>>
>>>>    static enum drm_gpu_sched_stat amdgpu_job_timedout(struct
>>> drm_sched_job *s_job)
>>>>    {
>>>> @@ -48,6 +103,7 @@ static enum drm_gpu_sched_stat
>>> amdgpu_job_timedout(struct drm_sched_job *s_job)
>>>>              return DRM_GPU_SCHED_STAT_ENODEV;
>>>>      }
>>>>
>>>> +   amdgpu_job_core_dump(adev, job);
>>> The philosophy is hang and recovery is to let the HW and software try to
>>> recover. Here we try to do a soft recovery first and i think we should wait for
>>> seft recovery and if fails then we do dump and thats exactly we are doing here.
>> Hi Sunil ,
>> thanks for the suggestion, and that's reasonable. But my concern is that after soft recovery happened, the GPU's status may change(take gfx 9 for example, it will try to kill the current hang wave)
>>   Actually, in most cases, a real shader hang cannot be resolved through soft recovery, and at that moment, we need to get a very close dump/snapshot/representation of GPU's current error status.
>> Just like the scandump, when we trying to do a scandump for a shader hang, we will disable gpu_recovery, and no soft recovery/per-queue reset/HW reset will happen before the scandump, right?
>> On most products, there are no scandump interfaces, so core dump is even more important for debugging GPU hang issue.
> I agree that it makes sense to take the dump as early as possible.  It
> makes sense to move it up now that we are starting to have finer
> grained resets.  We may want to wire devcoredump into the KFD side as
> well at some point.
In the current implementation we do stop the Scheduler and wait for 
existing fences to complete or signal them. But the new place to dump 
will not have anything like that and even though we have timeout some of 
the threads or waves might be still progressing the IP dump might be 
changing during that time.

But i am not sure if we want to stop the scheduling of new tasks and end 
the existing one.  How about we have multiple level of dumps i.e just 
capture and not dump as first dump is what is captured not last.
a. Capture after soft reset and let it be overwritten by next level
b. After soft reset fails capture again before hard reset is triggered.

So eventually we would have the ip dump file generated of what we are 
interested in.

Regards
Sunil K

>
> Alex
>
>> Regards,
>> Trigger
>>
>>> Also we need to make sure that the tasks which are already in queue are put
>>> on hold and the their sync points are signalled before we dump.
>>> check once what all steps are taken before we dump in the current
>>> implementation.
>> Do you mean sometimes like:
>>          drm_sched_wqueue_stop(&ring->sched);
>>          amdgpu_fence_driver_force_completion(ring); // Since there is no GPU reset happened, is it reasonable to call it here?
>>          amdgpu_job_core_dump(adev, job);
>>
>>
>> Regards,
>> Trigger
>>
>>> Regards
>>>
>>> Sunil khatri
>>>
>>>>      adev->job_hang = true;
>>>>
>>>> @@ -101,6 +157,12 @@ static enum drm_gpu_sched_stat
>>> amdgpu_job_timedout(struct drm_sched_job *s_job)
>>>>              reset_context.src = AMDGPU_RESET_SRC_JOB;
>>>>              clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
>>>>
>>>> +           /*
>>>> +            * To avoid an unnecessary extra coredump, as we have
>>> already
>>>> +            * got the very close representation of GPU's error status
>>>> +            */
>>>> +           set_bit(AMDGPU_SKIP_COREDUMP, &reset_context.flags);
>>>> +
>>>>              r = amdgpu_device_gpu_recover(ring->adev, job,
>>> &reset_context);
>>>>              if (r)
>>>>                      dev_err(adev->dev, "GPU Recovery Failed: %d\n", r);