[PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
Sharma, Shashank
shashank.sharma at amd.com
Fri Feb 4 16:59:00 UTC 2022
On 2/4/2022 5:50 PM, Lazar, Lijo wrote:
> [AMD Official Use Only]
>
> To explain more -
> It's an unconditional reset done by the kernel on every suspend (S3/S4). In such a case which process is going to receive the trace events?
>
> Most likely use case would be related to gpu recovery. Triggering a trace on every reset doesn't look like a good idea.
>
If you observer carefully, we are just providing an infrastructure, the
application's intention is unknown to us. In my opinion it's rather not
a good idea to apply a filter in kernel, with our interpretation of
intention.
For example if an app just wants to count how many resets are happening
due to S3/S4 transition, this infra might become useless. It would
rather be a better idea for the app to learn and ignore these scenarios
which it is not interested in.
This could eventually be just difference in design philosophy maybe :)
- Shashank
> Thanks,
> Lijo
>
> -----Original Message-----
> From: Sharma, Shashank <Shashank.Sharma at amd.com>
> Sent: Friday, February 4, 2022 10:09 PM
> To: Lazar, Lijo <Lijo.Lazar at amd.com>; amd-gfx at lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Somalapuram, Amaranath <Amaranath.Somalapuram at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>
> Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
>
> Hey Lijo,
> I somehow missed to respond on this comment, pls find inline:
>
> Regards
> Shashank
>
> On 1/22/2022 7:42 AM, Lazar, Lijo wrote:
>>
>>
>> On 1/22/2022 2:04 AM, Sharma, Shashank wrote:
>>> From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00
>>> 2001
>>> From: Somalapuram Amaranath <Amaranath.Somalapuram at amd.com>
>>> Date: Fri, 21 Jan 2022 14:19:42 +0530
>>> Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
>>>
>>> This patch adds a GPU reset handler for Navi ASIC family, which
>>> typically dumps some of the registersand sends a trace event.
>>>
>>> V2: Accomodated call to work function to send uevent
>>>
>>> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram at amd.com>
>>> Signed-off-by: Shashank Sharma <shashank.sharma at amd.com>
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++
>>> 1 file changed, 28 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c
>>> b/drivers/gpu/drm/amd/amdgpu/nv.c index 01efda4398e5..ada35d4c5245
>>> 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
>>> @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device
>>> *adev)
>>> }
>>> }
>>>
>>> +static void amdgpu_reset_dumps(struct amdgpu_device *adev) {
>>> + int r = 0, i;
>>> +
>>> + /* original raven doesn't have full asic reset */
>>> + if ((adev->apu_flags & AMD_APU_IS_RAVEN) &&
>>> + !(adev->apu_flags & AMD_APU_IS_RAVEN2))
>>> + return;
>>> + for (i = 0; i < adev->num_ip_blocks; i++) {
>>> + if (!adev->ip_blocks[i].status.valid)
>>> + continue;
>>> + if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps)
>>> + continue;
>>> + r =
>>> +adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev);
>>> +
>>> + if (r)
>>> + DRM_ERROR("reset_reg_dumps of IP block <%s> failed
>>> +%d\n",
>>> + adev->ip_blocks[i].version->funcs->name, r);
>>> + }
>>> +
>>> + /* Schedule work to send uevent */
>>> + if (!queue_work(system_unbound_wq, &adev->gpu_reset_work))
>>> + DRM_ERROR("failed to add GPU reset work\n");
>>> +
>>> + dump_stack();
>>> +}
>>> +
>>> static int nv_asic_reset(struct amdgpu_device *adev)
>>> {
>>> int ret = 0;
>>>
>>> + amdgpu_reset_dumps(adev);
>>
>> Had a comment on this before. Now there are different reasons (or even
>> no reason like a precautionary reset) to perform reset. A user would
>> be interested in a trace only if the reason is valid.
>>
>> To clarify on why a work shouldn't be scheduled on every reset, check
>> here -
>>
>> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amd
>> gpu/amdgpu_drv.c#L2188
> In the example you pointed to, they have a criteria to decide what is a valid reset in their context, in the kernel side itself. So they can take a call if they want to do something about it or not.
>
> But, in our case, we want to send the trace_event to user with some register values on every reset, and it is actually up to the profiling app to interpret (along with what it wants to call a GPU reset). So I don't think this is causing a considerable overhead.
>
> - Shashank
>>
>>
>>
>> Thanks,
>> Lijo
>>
>>> switch (nv_asic_reset_method(adev)) {
>>> case AMD_RESET_METHOD_PCI:
>>> dev_info(adev->dev, "PCI reset\n");
More information about the amd-gfx
mailing list