[PATCH v4 2/4] drm/amdgpu: Rework coredump to use memory dynamically

Shashank Sharma shashank.sharma at amd.com
Thu Aug 17 15:42:43 UTC 2023


On 17/08/2023 17:38, André Almeida wrote:
>
>
> Em 17/08/2023 12:26, Shashank Sharma escreveu:
>>
>> On 17/08/2023 17:17, André Almeida wrote:
>>>
>>>
>>> Em 17/08/2023 12:04, Shashank Sharma escreveu:
>>>>
>>>> On 17/08/2023 15:45, André Almeida wrote:
>>>>> Hi Shashank,
>>>>>
>>>>> Em 17/08/2023 03:41, Shashank Sharma escreveu:
>>>>>> Hello Andre,
>>>>>>
>>>>>> On 15/08/2023 21:50, André Almeida wrote:
>>>>>>> Instead of storing coredump information inside amdgpu_device 
>>>>>>> struct,
>>>>>>> move if to a proper separated struct and allocate it 
>>>>>>> dynamically. This
>>>>>>> will make it easier to further expand the logged information.
>>>>>>>
>>>>>>> Signed-off-by: André Almeida <andrealmeid at igalia.com>
>>>>>>> ---
>>>>>>> v4: change kmalloc to kzalloc
>>>>>>> ---
>>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 14 +++--
>>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 63 
>>>>>>> ++++++++++++++--------
>>>>>>>   2 files changed, 49 insertions(+), 28 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>> index 9c6a332261ab..0d560b713948 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>> @@ -1088,11 +1088,6 @@ struct amdgpu_device {
>>>>>>>       uint32_t *reset_dump_reg_list;
>>>>>>>       uint32_t            *reset_dump_reg_value;
>>>>>>>       int                             num_regs;
>>>>>>> -#ifdef CONFIG_DEV_COREDUMP
>>>>>>> -    struct amdgpu_task_info         reset_task_info;
>>>>>>> -    bool                            reset_vram_lost;
>>>>>>> -    struct timespec64               reset_time;
>>>>>>> -#endif
>>>>>>>       bool                            scpm_enabled;
>>>>>>>       uint32_t                        scpm_status;
>>>>>>> @@ -1105,6 +1100,15 @@ struct amdgpu_device {
>>>>>>>       uint32_t            aid_mask;
>>>>>>>   };
>>>>>>> +#ifdef CONFIG_DEV_COREDUMP
>>>>>>> +struct amdgpu_coredump_info {
>>>>>>> +    struct amdgpu_device        *adev;
>>>>>>> +    struct amdgpu_task_info         reset_task_info;
>>>>>>> +    struct timespec64               reset_time;
>>>>>>> +    bool                            reset_vram_lost;
>>>>>>> +};
>>>>>>
>>>>>> The patch looks good to me in general, but I would recommend 
>>>>>> slightly different arrangement and segregation of GPU reset 
>>>>>> information.
>>>>>>
>>>>>> Please consider a higher level structure adev->gpu_reset_info, 
>>>>>> and move everything related to reset dump info into that, 
>>>>>> including this new coredump_info structure, something like this:
>>>>>>
>>>>>> struct amdgpu_reset_info {
>>>>>>
>>>>>>      uint32_t *reset_dump_reg_list;
>>>>>>
>>>>>>      uint32_t *reset_dump_reg_value;
>>>>>>
>>>>>>      int num_regs;
>>>>>>
>>>>>
>>>>> Right, I can encapsulate there reset_dump members,
>>>>>
>>>>>> #ifdef CONFIG_DEV_COREDUMP
>>>>>>
>>>>>>     struct amdgpu_coredump_info *coredump_info;/* keep this 
>>>>>> dynamic allocation */
>>>>>
>>>>> but we don't need a pointer for amdgpu_coredump_info inside 
>>>>> amdgpu_device or inside of amdgpu_device->gpu_reset_info, right?
>>>>
>>>> I think it would be better if we keep all of the GPU reset related 
>>>> data in the same structure, so adev->gpu_reset_info->coredump_info 
>>>> sounds about right to me.
>>>>
>>>
>>> But after patch 2/4, we don't need to store a coredump_info pointer 
>>> inside adev, this is what I meant. What would be the purpose of 
>>> having this pointer? It's freed by amdgpu_devcoredump_free(), so we 
>>> don't need to keep track of it.
>>
>> Well, actually we are pulling in some 0parallel efforts on enhancing 
>> the GPU reset information, and we were planning to use the coredump 
>> info for some additional things. So if I have the coredump_info 
>> available (like reset_task_info and vram_lost) across a few functions 
>> in the driver with adev, it would make my job easy there :).
>
> It seems dangerous to use an object with this limited lifetime to rely 
> to read on. If you want to use it you will need to change 
> amdgpu_devcoredump_free() to drop a reference or you will need to use 
> it statically, which defeats the purpose of this patch. Anyway, I'll 
> add it as you requested.
>
I guess if the coredump_free function can make the

adev->reset_info->coredump_info= NULL, after freeing it, that will 
actually help the case.


While consuming it, I can simply check if 
(adev->reset_info->coredump_info) is available to be read.

- Shashank

>>
>> - Shashank
>>
>>>
>>>> - Shashank
>>>>
>>>>>
>>>>>>
>>>>>> #endif
>>>>>>
>>>>>> }
>>>>>>
>>>>>>
>>>>>> This will make sure that all the relevant information is at the 
>>>>>> same place.
>>>>>>
>>>>>> - Shashank
>>>>>>
>>>>>        amdgpu_inc_vram_lost(tmp_adev);


More information about the dri-devel mailing list