[PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
Lazar, Lijo
Lijo.Lazar at amd.com
Fri Feb 4 17:11:29 UTC 2022
[AMD Official Use Only]
No, otherwise driver reset only for GPU recovery purpose. S3/S4 is not meant for recovery purpose. It's just a precautionary reset to make sure that everything works fine on resume.
BTW, since this is already providing a set of values it would be useful to provide one more field as the reset reason - RAS error recovery, GPU hung recovery or something else.
Thanks,
Lijo
-----Original Message-----
From: Sharma, Shashank <Shashank.Sharma at amd.com>
Sent: Friday, February 4, 2022 10:37 PM
To: Lazar, Lijo <Lijo.Lazar at amd.com>; amd-gfx at lists.freedesktop.org
Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Somalapuram, Amaranath <Amaranath.Somalapuram at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>
Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
On 2/4/2022 6:02 PM, Lazar, Lijo wrote:
> [Public]
>
> The problem is app doesn't know why the reset happened. It just receives a bunch of registers to be read. On what basis an app can filter this out?
>
Again, that is contextual analysis capability, which needs to be embedded in the reader app. Even if we filter out the S3/S4 resets in the kernel, the situation remains the same, isn't it ?
- Shashank
> Thanks,
> Lijo
>
> -----Original Message-----
> From: Sharma, Shashank <Shashank.Sharma at amd.com>
> Sent: Friday, February 4, 2022 10:29 PM
> To: Lazar, Lijo <Lijo.Lazar at amd.com>; amd-gfx at lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Somalapuram,
> Amaranath <Amaranath.Somalapuram at amd.com>; Koenig, Christian
> <Christian.Koenig at amd.com>
> Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
>
>
>
> On 2/4/2022 5:50 PM, Lazar, Lijo wrote:
>> [AMD Official Use Only]
>>
>> To explain more -
>> It's an unconditional reset done by the kernel on every suspend (S3/S4). In such a case which process is going to receive the trace events?
>>
>> Most likely use case would be related to gpu recovery. Triggering a trace on every reset doesn't look like a good idea.
>>
>
> If you observer carefully, we are just providing an infrastructure, the application's intention is unknown to us. In my opinion it's rather not a good idea to apply a filter in kernel, with our interpretation of intention.
>
> For example if an app just wants to count how many resets are happening due to S3/S4 transition, this infra might become useless. It would rather be a better idea for the app to learn and ignore these scenarios which it is not interested in.
>
> This could eventually be just difference in design philosophy maybe :)
>
> - Shashank
>
>> Thanks,
>> Lijo
>>
>> -----Original Message-----
>> From: Sharma, Shashank <Shashank.Sharma at amd.com>
>> Sent: Friday, February 4, 2022 10:09 PM
>> To: Lazar, Lijo <Lijo.Lazar at amd.com>; amd-gfx at lists.freedesktop.org
>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Somalapuram,
>> Amaranath <Amaranath.Somalapuram at amd.com>; Koenig, Christian
>> <Christian.Koenig at amd.com>
>> Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
>>
>> Hey Lijo,
>> I somehow missed to respond on this comment, pls find inline:
>>
>> Regards
>> Shashank
>>
>> On 1/22/2022 7:42 AM, Lazar, Lijo wrote:
>>>
>>>
>>> On 1/22/2022 2:04 AM, Sharma, Shashank wrote:
>>>> From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17
>>>> 00:00:00
>>>> 2001
>>>> From: Somalapuram Amaranath <Amaranath.Somalapuram at amd.com>
>>>> Date: Fri, 21 Jan 2022 14:19:42 +0530
>>>> Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
>>>>
>>>> This patch adds a GPU reset handler for Navi ASIC family, which
>>>> typically dumps some of the registersand sends a trace event.
>>>>
>>>> V2: Accomodated call to work function to send uevent
>>>>
>>>> Signed-off-by: Somalapuram Amaranath
>>>> <Amaranath.Somalapuram at amd.com>
>>>> Signed-off-by: Shashank Sharma <shashank.sharma at amd.com>
>>>> ---
>>>> drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++
>>>> 1 file changed, 28 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c
>>>> b/drivers/gpu/drm/amd/amdgpu/nv.c index 01efda4398e5..ada35d4c5245
>>>> 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
>>>> @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device
>>>> *adev)
>>>> }
>>>> }
>>>>
>>>> +static void amdgpu_reset_dumps(struct amdgpu_device *adev) {
>>>> + int r = 0, i;
>>>> +
>>>> + /* original raven doesn't have full asic reset */
>>>> + if ((adev->apu_flags & AMD_APU_IS_RAVEN) &&
>>>> + !(adev->apu_flags & AMD_APU_IS_RAVEN2))
>>>> + return;
>>>> + for (i = 0; i < adev->num_ip_blocks; i++) {
>>>> + if (!adev->ip_blocks[i].status.valid)
>>>> + continue;
>>>> + if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps)
>>>> + continue;
>>>> + r =
>>>> +adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev);
>>>> +
>>>> + if (r)
>>>> + DRM_ERROR("reset_reg_dumps of IP block <%s> failed
>>>> +%d\n",
>>>> + adev->ip_blocks[i].version->funcs->name, r);
>>>> + }
>>>> +
>>>> + /* Schedule work to send uevent */
>>>> + if (!queue_work(system_unbound_wq, &adev->gpu_reset_work))
>>>> + DRM_ERROR("failed to add GPU reset work\n");
>>>> +
>>>> + dump_stack();
>>>> +}
>>>> +
>>>> static int nv_asic_reset(struct amdgpu_device *adev)
>>>> {
>>>> int ret = 0;
>>>>
>>>> + amdgpu_reset_dumps(adev);
>>>
>>> Had a comment on this before. Now there are different reasons (or
>>> even no reason like a precautionary reset) to perform reset. A user
>>> would be interested in a trace only if the reason is valid.
>>>
>>> To clarify on why a work shouldn't be scheduled on every reset,
>>> check here -
>>>
>>> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/a
>>> m
>>> d
>>> gpu/amdgpu_drv.c#L2188
>> In the example you pointed to, they have a criteria to decide what is a valid reset in their context, in the kernel side itself. So they can take a call if they want to do something about it or not.
>>
>> But, in our case, we want to send the trace_event to user with some register values on every reset, and it is actually up to the profiling app to interpret (along with what it wants to call a GPU reset). So I don't think this is causing a considerable overhead.
>>
>> - Shashank
>>>
>>>
>>>
>>> Thanks,
>>> Lijo
>>>
>>>> switch (nv_asic_reset_method(adev)) {
>>>> case AMD_RESET_METHOD_PCI:
>>>> dev_info(adev->dev, "PCI reset\n");
More information about the amd-gfx
mailing list