[PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler

Fri Feb 4 16:50:39 UTC 2022

[AMD Official Use Only]

To explain more -
	It's an unconditional reset done by the kernel on every suspend (S3/S4). In such a case which process is going to receive the trace events?

Most likely use case would be related to gpu recovery. Triggering a trace on every reset doesn't look like a good idea.

Thanks,
Lijo

-----Original Message-----
From: Sharma, Shashank <Shashank.Sharma at amd.com> 
Sent: Friday, February 4, 2022 10:09 PM
To: Lazar, Lijo <Lijo.Lazar at amd.com>; amd-gfx at lists.freedesktop.org
Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Somalapuram, Amaranath <Amaranath.Somalapuram at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>
Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler

Hey Lijo,
I somehow missed to respond on this comment, pls find inline:

Regards
Shashank

On 1/22/2022 7:42 AM, Lazar, Lijo wrote:
> 
> 
> On 1/22/2022 2:04 AM, Sharma, Shashank wrote:
>>  From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00 
>> 2001
>> From: Somalapuram Amaranath <Amaranath.Somalapuram at amd.com>
>> Date: Fri, 21 Jan 2022 14:19:42 +0530
>> Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler
>>
>> This patch adds a GPU reset handler for Navi ASIC family, which 
>> typically dumps some of the registersand sends a trace event.
>>
>> V2: Accomodated call to work function to send uevent
>>
>> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram at amd.com>
>> Signed-off-by: Shashank Sharma <shashank.sharma at amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++
>>   1 file changed, 28 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c 
>> b/drivers/gpu/drm/amd/amdgpu/nv.c index 01efda4398e5..ada35d4c5245 
>> 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
>> @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device 
>> *adev)
>>       }
>>   }
>>
>> +static void amdgpu_reset_dumps(struct amdgpu_device *adev) {
>> +    int r = 0, i;
>> +
>> +    /* original raven doesn't have full asic reset */
>> +    if ((adev->apu_flags & AMD_APU_IS_RAVEN) &&
>> +        !(adev->apu_flags & AMD_APU_IS_RAVEN2))
>> +        return;
>> +    for (i = 0; i < adev->num_ip_blocks; i++) {
>> +        if (!adev->ip_blocks[i].status.valid)
>> +            continue;
>> +        if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps)
>> +            continue;
>> +        r = 
>> +adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev);
>> +
>> +        if (r)
>> +            DRM_ERROR("reset_reg_dumps of IP block <%s> failed 
>> +%d\n",
>> +                    adev->ip_blocks[i].version->funcs->name, r);
>> +    }
>> +
>> +    /* Schedule work to send uevent */
>> +    if (!queue_work(system_unbound_wq, &adev->gpu_reset_work))
>> +        DRM_ERROR("failed to add GPU reset work\n");
>> +
>> +    dump_stack();
>> +}
>> +
>>   static int nv_asic_reset(struct amdgpu_device *adev)
>>   {
>>       int ret = 0;
>>
>> +    amdgpu_reset_dumps(adev);
> 
> Had a comment on this before. Now there are different reasons (or even 
> no reason like a precautionary reset) to perform reset. A user would 
> be interested in a trace only if the reason is valid.
> 
> To clarify on why a work shouldn't be scheduled on every reset, check 
> here -
> 
> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amd
> gpu/amdgpu_drv.c#L2188
In the example you pointed to, they have a criteria to decide what is a valid reset in their context, in the kernel side itself. So they can take a call if they want to do something about it or not.

But, in our case, we want to send the trace_event to user with some register values on every reset, and it is actually up to the profiling app to interpret (along with what it wants to call a GPU reset). So I don't think this is causing a considerable overhead.

- Shashank
> 
> 
> 
> Thanks,
> Lijo
> 
>>       switch (nv_asic_reset_method(adev)) {
>>       case AMD_RESET_METHOD_PCI:
>>           dev_info(adev->dev, "PCI reset\n");