[PATCH 1/2] drm: Add GPU reset sysfs event

Tue Mar 8 15:37:37 UTC 2022

On 3/8/2022 5:26 PM, Sharma, Shashank wrote:
>
>
> On 3/8/2022 11:32 AM, Christian König wrote:
>> Am 08.03.22 um 10:31 schrieb Sharma, Shashank:
>>>
>>>
>>> On 3/8/2022 8:06 AM, Christian König wrote:
>>>> Am 07.03.22 um 17:26 schrieb Shashank Sharma:
>>>>> From: Shashank Sharma <shashank.sharma at amd.com>
>>>>>
>>>>> This patch adds a new sysfs event, which will indicate
>>>>> the userland about a GPU reset, and can also provide
>>>>> some information like:
>>>>> - which PID was involved in the GPU reset
>>>>> - what was the GPU status (using flags)
>>>>>
>>>>> This patch also introduces the first flag of the flags
>>>>> bitmap, which can be appended as and when required.
>>>>
>>>> Make sure to CC the dri-devel mailing list when reviewing this.
>>> Got it,
>>>
>>> I was also curious if we want to move the reset_ctx structure itself 
>>> to DRM layer, like
>>> drm_reset_event_ctx {
>>>     u32 pid;
>>>     u32 flags;
>>>     char process_name[64];
>>> };
>>
>> I was entertaining that thought as well.
>>
>> But if we do this I would go even a step further and also move the 
>> reset work item into the DRM layer as well.
>>
>> You might also look like into migrating the exiting i915 code which 
>> uses udev to signal GPU resets to this function as well.
>>
Hi Christian,

Can we access adev in common drm (even if we can access adev it will not 
be common code)

move work function to drm need to be protected(i.e reset_domain->sem), 
adding something like reset_sem to drm_device?

Regards,

S.Amarnath

>> Regards,
>> Christian.
>
> That seems like a good idea, let me quickly dive into i915 and check 
> this out.
>
> Shashank
>>
>>>
>>> and then:
>>> void drm_sysfs_reset_event(struct drm_device *dev, 
>>> drm_reset_event_ctx *ctx);
>>>
>>>>
>>>>>
>>>>> Cc: Alexandar Deucher <alexander.deucher at amd.com>
>>>>> Cc: Christian Koenig <christian.koenig at amd.com>
>>>>> Signed-off-by: Shashank Sharma <shashank.sharma at amd.com>
>>>>> ---
>>>>>   drivers/gpu/drm/drm_sysfs.c | 24 ++++++++++++++++++++++++
>>>>>   include/drm/drm_sysfs.h     |  3 +++
>>>>>   2 files changed, 27 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/drm_sysfs.c 
>>>>> b/drivers/gpu/drm/drm_sysfs.c
>>>>> index 430e00b16eec..52a015161431 100644
>>>>> --- a/drivers/gpu/drm/drm_sysfs.c
>>>>> +++ b/drivers/gpu/drm/drm_sysfs.c
>>>>> @@ -409,6 +409,30 @@ void drm_sysfs_hotplug_event(struct 
>>>>> drm_device *dev)
>>>>>   }
>>>>>   EXPORT_SYMBOL(drm_sysfs_hotplug_event);
>>>>> +/**
>>>>> + * drm_sysfs_reset_event - generate a DRM uevent to indicate GPU 
>>>>> reset
>>>>> + * @dev: DRM device
>>>>> + * @pid: The process ID involve with the reset
>>>>> + * @flags: Any other information about the GPU status
>>>>> + *
>>>>> + * Send a uevent for the DRM device specified by @dev. This 
>>>>> indicates
>>>>> + * user that a GPU reset has occurred, so that the interested client
>>>>> + * can take any recovery or profiling measure, when required.
>>>>> + */
>>>>> +void drm_sysfs_reset_event(struct drm_device *dev, uint64_t pid, 
>>>>> uint32_t flags)
>>>>
>>>> The PID is usually only 32bit, but even better would be to use pid_t.
>>>>
>>>>> +{
>>>>> +    unsigned char pid_str[21], flags_str[15];
>>>>> +    unsigned char reset_str[] = "RESET=1";
>>>>> +    char *envp[] = { reset_str, pid_str, flags_str, NULL };
>>>>> +
>>>>> +    DRM_DEBUG("generating reset event\n");
>>>>> +
>>>>> +    snprintf(pid_str, ARRAY_SIZE(pid_str), "PID=%lu", pid);
>>>>> +    snprintf(flags_str, ARRAY_SIZE(flags_str), "FLAGS=%u", flags);
>>>>> + kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
>>>>> +}
>>>>> +EXPORT_SYMBOL(drm_sysfs_reset_event);
>>>>> +
>>>>>   /**
>>>>>    * drm_sysfs_connector_hotplug_event - generate a DRM uevent for 
>>>>> any connector
>>>>>    * change
>>>>> diff --git a/include/drm/drm_sysfs.h b/include/drm/drm_sysfs.h
>>>>> index 6273cac44e47..63f00fe8054c 100644
>>>>> --- a/include/drm/drm_sysfs.h
>>>>> +++ b/include/drm/drm_sysfs.h
>>>>> @@ -2,6 +2,8 @@
>>>>>   #ifndef _DRM_SYSFS_H_
>>>>>   #define _DRM_SYSFS_H_
>>>>> +#define DRM_GPU_RESET_FLAG_VRAM_VALID (1 << 0)
>>>>
>>>> Probably better to define that the other way around, e.g. 
>>>> DRM_GPU_RESET_FLAG_VRAM_LOST.
>>>>
>>>> Apart from that looks good to me.
>>>>
>>> Got it, noted.
>>> - Shashank
>>>
>>>> Christian.
>>>>
>>>>> +
>>>>>   struct drm_device;
>>>>>   struct device;
>>>>>   struct drm_connector;
>>>>> @@ -11,6 +13,7 @@ int drm_class_device_register(struct device *dev);
>>>>>   void drm_class_device_unregister(struct device *dev);
>>>>>   void drm_sysfs_hotplug_event(struct drm_device *dev);
>>>>> +void drm_sysfs_reset_event(struct drm_device *dev, uint64_t pid, 
>>>>> uint32_t reset_flags);
>>>>>   void drm_sysfs_connector_hotplug_event(struct drm_connector 
>>>>> *connector);
>>>>>   void drm_sysfs_connector_status_event(struct drm_connector 
>>>>> *connector,
>>>>>                         struct drm_property *property);
>>>>
>>