[PATCH v4 2/4] drm/amdgpu: Rework coredump to use memory dynamically
Shashank Sharma
shashank.sharma at amd.com
Thu Aug 17 15:42:43 UTC 2023
On 17/08/2023 17:38, André Almeida wrote:
>
>
> Em 17/08/2023 12:26, Shashank Sharma escreveu:
>>
>> On 17/08/2023 17:17, André Almeida wrote:
>>>
>>>
>>> Em 17/08/2023 12:04, Shashank Sharma escreveu:
>>>>
>>>> On 17/08/2023 15:45, André Almeida wrote:
>>>>> Hi Shashank,
>>>>>
>>>>> Em 17/08/2023 03:41, Shashank Sharma escreveu:
>>>>>> Hello Andre,
>>>>>>
>>>>>> On 15/08/2023 21:50, André Almeida wrote:
>>>>>>> Instead of storing coredump information inside amdgpu_device
>>>>>>> struct,
>>>>>>> move if to a proper separated struct and allocate it
>>>>>>> dynamically. This
>>>>>>> will make it easier to further expand the logged information.
>>>>>>>
>>>>>>> Signed-off-by: André Almeida <andrealmeid at igalia.com>
>>>>>>> ---
>>>>>>> v4: change kmalloc to kzalloc
>>>>>>> ---
>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 14 +++--
>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 63
>>>>>>> ++++++++++++++--------
>>>>>>> 2 files changed, 49 insertions(+), 28 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>> index 9c6a332261ab..0d560b713948 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>> @@ -1088,11 +1088,6 @@ struct amdgpu_device {
>>>>>>> uint32_t *reset_dump_reg_list;
>>>>>>> uint32_t *reset_dump_reg_value;
>>>>>>> int num_regs;
>>>>>>> -#ifdef CONFIG_DEV_COREDUMP
>>>>>>> - struct amdgpu_task_info reset_task_info;
>>>>>>> - bool reset_vram_lost;
>>>>>>> - struct timespec64 reset_time;
>>>>>>> -#endif
>>>>>>> bool scpm_enabled;
>>>>>>> uint32_t scpm_status;
>>>>>>> @@ -1105,6 +1100,15 @@ struct amdgpu_device {
>>>>>>> uint32_t aid_mask;
>>>>>>> };
>>>>>>> +#ifdef CONFIG_DEV_COREDUMP
>>>>>>> +struct amdgpu_coredump_info {
>>>>>>> + struct amdgpu_device *adev;
>>>>>>> + struct amdgpu_task_info reset_task_info;
>>>>>>> + struct timespec64 reset_time;
>>>>>>> + bool reset_vram_lost;
>>>>>>> +};
>>>>>>
>>>>>> The patch looks good to me in general, but I would recommend
>>>>>> slightly different arrangement and segregation of GPU reset
>>>>>> information.
>>>>>>
>>>>>> Please consider a higher level structure adev->gpu_reset_info,
>>>>>> and move everything related to reset dump info into that,
>>>>>> including this new coredump_info structure, something like this:
>>>>>>
>>>>>> struct amdgpu_reset_info {
>>>>>>
>>>>>> uint32_t *reset_dump_reg_list;
>>>>>>
>>>>>> uint32_t *reset_dump_reg_value;
>>>>>>
>>>>>> int num_regs;
>>>>>>
>>>>>
>>>>> Right, I can encapsulate there reset_dump members,
>>>>>
>>>>>> #ifdef CONFIG_DEV_COREDUMP
>>>>>>
>>>>>> struct amdgpu_coredump_info *coredump_info;/* keep this
>>>>>> dynamic allocation */
>>>>>
>>>>> but we don't need a pointer for amdgpu_coredump_info inside
>>>>> amdgpu_device or inside of amdgpu_device->gpu_reset_info, right?
>>>>
>>>> I think it would be better if we keep all of the GPU reset related
>>>> data in the same structure, so adev->gpu_reset_info->coredump_info
>>>> sounds about right to me.
>>>>
>>>
>>> But after patch 2/4, we don't need to store a coredump_info pointer
>>> inside adev, this is what I meant. What would be the purpose of
>>> having this pointer? It's freed by amdgpu_devcoredump_free(), so we
>>> don't need to keep track of it.
>>
>> Well, actually we are pulling in some 0parallel efforts on enhancing
>> the GPU reset information, and we were planning to use the coredump
>> info for some additional things. So if I have the coredump_info
>> available (like reset_task_info and vram_lost) across a few functions
>> in the driver with adev, it would make my job easy there :).
>
> It seems dangerous to use an object with this limited lifetime to rely
> to read on. If you want to use it you will need to change
> amdgpu_devcoredump_free() to drop a reference or you will need to use
> it statically, which defeats the purpose of this patch. Anyway, I'll
> add it as you requested.
>
I guess if the coredump_free function can make the
adev->reset_info->coredump_info= NULL, after freeing it, that will
actually help the case.
While consuming it, I can simply check if
(adev->reset_info->coredump_info) is available to be read.
- Shashank
>>
>> - Shashank
>>
>>>
>>>> - Shashank
>>>>
>>>>>
>>>>>>
>>>>>> #endif
>>>>>>
>>>>>> }
>>>>>>
>>>>>>
>>>>>> This will make sure that all the relevant information is at the
>>>>>> same place.
>>>>>>
>>>>>> - Shashank
>>>>>>
>>>>> amdgpu_inc_vram_lost(tmp_adev);
More information about the amd-gfx
mailing list