[PATCH 3/3] drm/amdgpu: add AMDGPURESET uevent on AMD GPU reset
Christian König
ckoenig.leichtzumerken at gmail.com
Tue Jan 18 11:02:00 UTC 2022
Am 18.01.22 um 11:40 schrieb S, Shirish:
> Hi Shashank,
>
>
> On 1/12/2022 6:30 PM, Sharma, Shashank wrote:
>>
>>
>> On 1/11/2022 12:26 PM, Christian König wrote:
>>> Am 11.01.22 um 08:12 schrieb Somalapuram Amaranath:
>>>> AMDGPURESET uevent added to notify userspace,
>>>> collect dump_stack and amdgpu_reset_reg_dumps
>>>>
>>>> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram at amd.com>
>>>> ---
>>>> drivers/gpu/drm/amd/amdgpu/nv.c | 31 +++++++++++++++++++++++++++++++
>>>> 1 file changed, 31 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c
>>>> b/drivers/gpu/drm/amd/amdgpu/nv.c
>>>> index 2ec1ffb36b1f..41a2c37e825f 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
>>>> @@ -529,10 +529,41 @@ nv_asic_reset_method(struct amdgpu_device *adev)
>>>> }
>>>> }
>>>> +/**
>>>> + * drm_sysfs_reset_event - generate a DRM uevent
>>>> + * @dev: DRM device
>>>> + *
>>>> + * Send a uevent for the DRM device specified by @dev. Currently
>>>> we only
>>>> + * set AMDGPURESET=1 in the uevent environment, but this could be
>>>> expanded to
>>>> + * deal with other types of events.
>>>> + *
>>>> + * Any new uapi should be using the
>>>> drm_sysfs_connector_status_event()
>>>> + * for uevents on connector status change.
>>>> + */
>>>> +void drm_sysfs_reset_event(struct drm_device *dev)
>>>> +{
>>>> + char *event_string = "AMDGPURESET=1";
>>>> + char *envp[2] = { event_string, NULL };
>>>> +
>>>> + kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
>>>
>>> That won't work like this.
>>>
>>> kobject_uevent_env() needs to allocate memory to send the event to
>>> userspace and that is not allowed while we do an reset. The Intel
>>> guys felt into the same trap already.
>>>
>>> What we could maybe do is to teach kobject_uevent_env() gfp flags
>>> and make all allocations from the atomic pool.
>>>
>>> Regards,
>>> Christian.
>>
>> Hi Amar,
>>
>> I see another problem here,
>>
>> We are sending the event at the GPU reset, but we are collecting the
>> register values only when the corresponding userspace agent calls a
>> read() on the respective sysfs entry.
>
> Is the presumption here that gpu reset is always triggered within
> kernel & user space has to be made aware of it?
Yes, pretty much.
>
> From what I know OS'/apps use GL extensions like robustness and other
> ways to detect hangs/gpu resets and flush out guilty contexts or take
> approp next steps.
>
> BTW, is there any userspace infra already in place that have a
> task/thread listening for reset events implemented, similar to hpd?
No, it's also completely different to HPD since we need to gather and
save prerecorded data of the crash.
>
> I believe there are several ways to make user space aware of reset via
> gpu_reset_counter etc, also if the objective is the have call trace
> upon reset or dump registers you can do it in the
> amdgpu_device_gpu_recover() but guard it with a proper CONFIG
>
> that can be enabled in kernel's debug builds only, like tag along with
> KASAN etc.,
>
> This way there will be lesser dependency on userspace.
That's a really bad idea. Gathering crash dump data should work on
production kernels as well and is nothing we really need a compiler
switch for.
Regards,
Christian.
>
>
> Regards,
>
> Shirish S
>
>>
>> There is a very fair possibility that the register values are reset
>> by the HW by then, and we are reading re-programmed values. At least
>> there will be a race().
>>
>> I think we should change this design in such a way:
>> 1. Get into gpu_reset()
>> 2. collect the register values and save this context into a separate
>> file/node. Probably sending a trace_event here would be easiest way.
>> 3. Send the drm event to the userspace client
>> 4. The client reads from the trace file, and gets the data.
>>
>> - Shashank
>>
>>>
>>>> +}
>>>> +
>>>> +void amdgpu_reset_dumps(struct amdgpu_device *adev)
>>>> +{
>>>> + struct drm_device *ddev = adev_to_drm(adev);
>>>> + /* original raven doesn't have full asic reset */
>>>> + if ((adev->apu_flags & AMD_APU_IS_RAVEN) &&
>>>> + !(adev->apu_flags & AMD_APU_IS_RAVEN2))
>>>> + return;
>>>> + drm_sysfs_reset_event(ddev);
>>>> + dump_stack();
>>>> +}
>>>> +
>>>> static int nv_asic_reset(struct amdgpu_device *adev)
>>>> {
>>>> int ret = 0;
>>>> + amdgpu_reset_dumps(adev);
>>>> switch (nv_asic_reset_method(adev)) {
>>>> case AMD_RESET_METHOD_PCI:
>>>> dev_info(adev->dev, "PCI reset\n");
>>>
More information about the amd-gfx
mailing list