[PATCH] drm/amdgpu: add post reset IP callback

Wed Apr 3 08:44:09 UTC 2024

On 03/04/2024 09:31, Yu, Lang wrote:
> [AMD Official Use Only - General]
>
>> -----Original Message-----
>> From: Sharma, Shashank <Shashank.Sharma at amd.com>
>> Sent: Wednesday, April 3, 2024 3:19 PM
>> To: Yu, Lang <Lang.Yu at amd.com>; Christian König
>> <ckoenig.leichtzumerken at gmail.com>; amd-gfx at lists.freedesktop.org
>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Koenig, Christian
>> <Christian.Koenig at amd.com>
>> Subject: Re: [PATCH] drm/amdgpu: add post reset IP callback
>>
>> Hey Lang,
>>
>> On 03/04/2024 08:51, Yu, Lang wrote:
>>> [AMD Official Use Only - General]
>>>
>>>> -----Original Message-----
>>>> From: Christian König <ckoenig.leichtzumerken at gmail.com>
>>>> Sent: Tuesday, April 2, 2024 9:38 PM
>>>> To: Yu, Lang <Lang.Yu at amd.com>; amd-gfx at lists.freedesktop.org
>>>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Koenig, Christian
>>>> <Christian.Koenig at amd.com>; Sharma, Shashank
>>>> <Shashank.Sharma at amd.com>
>>>> Subject: Re: [PATCH] drm/amdgpu: add post reset IP callback
>>>>
>>>> Am 28.03.24 um 05:40 schrieb Lang Yu:
>>>>> There are use cases which need full GPU functionality (e.g., VM
>>>>> update, TLB inavildate) when doing a GPU reset.
>>>>>
>>>>> Especially, the mes/umsch self tests which help validate the hw
>>>>> state after hw init like ring/ib tests.
>>>> I noted that before but just to repeat it once more: We can't do any
>>>> MES or UMSCH validation while doing a GPU reset!
>>> Yes, we can just easily disable it if it doesn't work well.
>>> But it doesn't take too much effort to make it work.
>>> It can expose issues as soon as possible and is useful for debugging
>> purpose.
>> IMO, its not that useful for debugging as well. In case of a problem, It will just
>> timeout waiting for MES packet write and we will still have to find out the
>> actual problem which caused MES to go into bad state in the last GPU reset.
>>>> The ring and IB tests use some pre-allocated memory we put aside for
>>>> the task during driver load and so can execute during GPU reset as well.
>>> If user space can create a VM and allocate memory during GPU reset, it
>>> makes no sense to prevent kernel space from doing that.
>> I think the objection here is mostly about why to do it at all, when it is not
>> helpful. It would be just a maintenance overhead.
> If you think it is not helpful,  why doing ring/ib tests?
That's because they add value during the bootup, when we know that the 
HW is being initialized in a controlled/systematic way. But in case of a 
GPU reset, it could go anyway and could be neither of those.
> I don't think such sanity test is not helpful.
>
> I only talk UMSCH test(different with MES test) here,

Your comment above about ring tests talks about both MES/UMSCH self 
tests, so it attracts attention on both.

If its only about UMSCH tests, I will let someone else to comment on that.

- Shashank

> I don't think it has a maintenance overhead.
>
> Regards,
> Lang
>
>> - Shashank
>>
>>>> But for the MES/UMSCH we need a full blown environment with VM and
>>>> submission infrastructure and setting that up isn't possible here.
>>> At least for UMSCH test, it only uses VM mapping functionality (if we
>>> can create a VM with cpu update mode, that's enough), it doesn't use
>>> other submission functionality.
>>> It is actually a compute context.
>>>
>>>
>>> Regards,
>>> Lang
>>>
>>>> Adding Shashank as well, but I think we should probably just
>>>> completely remove those from the kernel.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Add a post reset IP callback to handle such use cases which will be
>>>>> executed after GPU reset succeeds.
>>>>>
>>>>> Signed-off-by: Lang Yu <Lang.Yu at amd.com>
>>>>> ---
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24
>>>> ++++++++++++++++++++++
>>>>>     drivers/gpu/drm/amd/include/amd_shared.h   |  3 +++
>>>>>     2 files changed, 27 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index 12dc71a6b5db..feeab9397aab 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -5556,6 +5556,27 @@ static int
>> amdgpu_device_health_check(struct
>>>> list_head *device_list_handle)
>>>>>        return ret;
>>>>>     }
>>>>>
>>>>> +static int amdgpu_device_ip_post_reset(struct amdgpu_device *adev) {
>>>>> +    uint32_t i;
>>>>> +    int r;
>>>>> +
>>>>> +    for (i = 0; i < adev->num_ip_blocks; i++) {
>>>>> +            if (!adev->ip_blocks[i].status.valid ||
>>>>> +                !adev->ip_blocks[i].version->funcs->post_reset)
>>>>> +                    continue;
>>>>> +
>>>>> +            r = adev->ip_blocks[i].version->funcs->post_reset(adev);
>>>>> +            if (r) {
>>>>> +                    DRM_ERROR("post reset of IP block <%s>
>>>> failed %d\n",
>>>>> +                              adev->ip_blocks[i].version->funcs->name, r);
>>>>> +                    return r;
>>>>> +            }
>>>>> +    }
>>>>> +
>>>>> +    return r;
>>>>> +}
>>>>> +
>>>>>     /**
>>>>>      * amdgpu_device_gpu_recover - reset the asic and recover scheduler
>>>>>      *
>>>>> @@ -5805,6 +5826,9 @@ int amdgpu_device_gpu_recover(struct
>>>> amdgpu_device *adev,
>>>>>                amdgpu_put_xgmi_hive(hive);
>>>>>        }
>>>>>
>>>>> +    if (!r && !job_signaled)
>>>>> +            r = amdgpu_device_ip_post_reset(adev);
>>>>> +
>>>>>        if (r)
>>>>>                dev_info(adev->dev, "GPU reset end with ret = %d\n",
>>>>> r);
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/include/amd_shared.h
>>>>> b/drivers/gpu/drm/amd/include/amd_shared.h
>>>>> index b0a6256e89f4..33ce30a8e3ab 100644
>>>>> --- a/drivers/gpu/drm/amd/include/amd_shared.h
>>>>> +++ b/drivers/gpu/drm/amd/include/amd_shared.h
>>>>> @@ -287,6 +287,7 @@ enum amd_dpm_forced_level;
>>>>>      * @pre_soft_reset: pre soft reset the IP block
>>>>>      * @soft_reset: soft reset the IP block
>>>>>      * @post_soft_reset: post soft reset the IP block
>>>>> + * @post_reset: handles IP specific post reset stuff(e.g., self
>>>>> + test)
>>>>>      * @set_clockgating_state: enable/disable cg for the IP block
>>>>>      * @set_powergating_state: enable/disable pg for the IP block
>>>>>      * @get_clockgating_state: get current clockgating status @@
>>>>> -316,11
>>>>> +317,13 @@ struct amd_ip_funcs {
>>>>>        int (*pre_soft_reset)(void *handle);
>>>>>        int (*soft_reset)(void *handle);
>>>>>        int (*post_soft_reset)(void *handle);
>>>>> +    int (*post_reset)(void *handle);
>>>>>        int (*set_clockgating_state)(void *handle,
>>>>>                                     enum amd_clockgating_state state);
>>>>>        int (*set_powergating_state)(void *handle,
>>>>>                                     enum amd_powergating_state state);
>>>>>        void (*get_clockgating_state)(void *handle, u64 *flags);
>>>>> +
>>>>>     };
>>>>>
>>>>>