[PATCH v2 1/2] drm: Add GPU reset sysfs event
Christian König
christian.koenig at amd.com
Thu Mar 17 09:21:41 UTC 2022
Am 17.03.22 um 09:42 schrieb Sharma, Shashank:
> On 3/16/2022 10:50 PM, Rob Clark wrote:
>> On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
>> <contactshashanksharma at gmail.com> wrote:
>>>
>>> From: Shashank Sharma <shashank.sharma at amd.com>
>>>
>>> This patch adds a new sysfs event, which will indicate
>>> the userland about a GPU reset, and can also provide
>>> some information like:
>>> - process ID of the process involved with the GPU reset
>>> - process name of the involved process
>>> - the GPU status info (using flags)
>>>
>>> This patch also introduces the first flag of the flags
>>> bitmap, which can be appended as and when required.
>>>
>>> V2: Addressed review comments from Christian and Amar
>>> - move the reset information structure to DRM layer
>>> - drop _ctx from struct name
>>> - make pid 32 bit(than 64)
>>> - set flag when VRAM invalid (than valid)
>>> - add process name as well (Amar)
>>>
>>> Cc: Alexandar Deucher <alexander.deucher at amd.com>
>>> Cc: Christian Koenig <christian.koenig at amd.com>
>>> Cc: Amaranath Somalapuram <amaranath.somalapuram at amd.com>
>>> Signed-off-by: Shashank Sharma <shashank.sharma at amd.com>
>>> ---
>>> drivers/gpu/drm/drm_sysfs.c | 31 +++++++++++++++++++++++++++++++
>>> include/drm/drm_sysfs.h | 10 ++++++++++
>>> 2 files changed, 41 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/drm_sysfs.c b/drivers/gpu/drm/drm_sysfs.c
>>> index 430e00b16eec..840994810910 100644
>>> --- a/drivers/gpu/drm/drm_sysfs.c
>>> +++ b/drivers/gpu/drm/drm_sysfs.c
>>> @@ -409,6 +409,37 @@ void drm_sysfs_hotplug_event(struct drm_device
>>> *dev)
>>> }
>>> EXPORT_SYMBOL(drm_sysfs_hotplug_event);
>>>
>>> +/**
>>> + * drm_sysfs_reset_event - generate a DRM uevent to indicate GPU reset
>>> + * @dev: DRM device
>>> + * @reset_info: The contextual information about the reset (like
>>> PID, flags)
>>> + *
>>> + * Send a uevent for the DRM device specified by @dev. This informs
>>> + * user that a GPU reset has occurred, so that an interested client
>>> + * can take any recovery or profiling measure.
>>> + */
>>> +void drm_sysfs_reset_event(struct drm_device *dev, struct
>>> drm_reset_event *reset_info)
>>> +{
>>> + unsigned char pid_str[13];
>>> + unsigned char flags_str[15];
>>> + unsigned char pname_str[TASK_COMM_LEN + 6];
>>> + unsigned char reset_str[] = "RESET=1";
>>> + char *envp[] = { reset_str, pid_str, pname_str, flags_str,
>>> NULL };
>>> +
>>> + if (!reset_info) {
>>> + DRM_WARN("No reset info, not sending the event\n");
>>> + return;
>>> + }
>>> +
>>> + DRM_DEBUG("generating reset event\n");
>>> +
>>> + snprintf(pid_str, ARRAY_SIZE(pid_str), "PID=%u",
>>> reset_info->pid);
>>> + snprintf(pname_str, ARRAY_SIZE(pname_str), "NAME=%s",
>>> reset_info->pname);
>>> + snprintf(flags_str, ARRAY_SIZE(flags_str), "FLAGS=%u",
>>> reset_info->flags);
>>> + kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
>>> +}
>>> +EXPORT_SYMBOL(drm_sysfs_reset_event);
>>> +
>>> /**
>>> * drm_sysfs_connector_hotplug_event - generate a DRM uevent for
>>> any connector
>>> * change
>>> diff --git a/include/drm/drm_sysfs.h b/include/drm/drm_sysfs.h
>>> index 6273cac44e47..5ba11c760619 100644
>>> --- a/include/drm/drm_sysfs.h
>>> +++ b/include/drm/drm_sysfs.h
>>> @@ -1,16 +1,26 @@
>>> /* SPDX-License-Identifier: GPL-2.0 */
>>> #ifndef _DRM_SYSFS_H_
>>> #define _DRM_SYSFS_H_
>>> +#include <linux/sched.h>
>>> +
>>> +#define DRM_GPU_RESET_FLAG_VRAM_INVALID (1 << 0)
>>>
>>> struct drm_device;
>>> struct device;
>>> struct drm_connector;
>>> struct drm_property;
>>>
>>> +struct drm_reset_event {
>>> + uint32_t pid;
>>
>> One side note, unrelated to devcoredump vs this..
>>
>> AFAIU you probably want to be passing around a `struct pid *`, and
>> then somehow use pid_vnr() in the context of the process reading the
>> event to get the numeric pid. Otherwise things will not do what you
>> expect if the process triggering the crash is in a different pid
>> namespace from the compositor.
>>
>
> I am not sure if it is a good idea to add the pid extraction
> complexity in here, it is left upto the driver to extract this
> information and pass it to the work queue. In case of AMDGPU, its
> extracted from GPU VM. It would be then more flexible for the drivers
> as well.
Yeah, but that is just used for debugging.
If we want to use the pid for housekeeping, like for a daemon which
kills/restarts processes, we absolutely need that or otherwise won't be
able to work with containers.
Regards,
Christian.
>
> - Shashank
>
>> BR,
>> -R
>>
>>> + uint32_t flags;
>>> + char pname[TASK_COMM_LEN];
>>> +};
>>> +
>>> int drm_class_device_register(struct device *dev);
>>> void drm_class_device_unregister(struct device *dev);
>>>
>>> void drm_sysfs_hotplug_event(struct drm_device *dev);
>>> +void drm_sysfs_reset_event(struct drm_device *dev, struct
>>> drm_reset_event *reset_info);
>>> void drm_sysfs_connector_hotplug_event(struct drm_connector
>>> *connector);
>>> void drm_sysfs_connector_status_event(struct drm_connector
>>> *connector,
>>> struct drm_property *property);
>>> --
>>> 2.32.0
>>>
More information about the amd-gfx
mailing list