[PATCH 1/2] drm: Add GPU reset sysfs event

Wed Mar 9 08:05:19 UTC 2022

Am 08.03.22 um 17:40 schrieb Sharma, Shashank:
>
>
> On 3/8/2022 12:56 PM, Sharma, Shashank wrote:
>>
>>
>> On 3/8/2022 11:32 AM, Christian König wrote:
>>> Am 08.03.22 um 10:31 schrieb Sharma, Shashank:
>>>>
>>>>
>>>> On 3/8/2022 8:06 AM, Christian König wrote:
>>>>> Am 07.03.22 um 17:26 schrieb Shashank Sharma:
>>>>>> From: Shashank Sharma <shashank.sharma at amd.com>
>>>>>>
>>>>>> This patch adds a new sysfs event, which will indicate
>>>>>> the userland about a GPU reset, and can also provide
>>>>>> some information like:
>>>>>> - which PID was involved in the GPU reset
>>>>>> - what was the GPU status (using flags)
>>>>>>
>>>>>> This patch also introduces the first flag of the flags
>>>>>> bitmap, which can be appended as and when required.
>>>>>
>>>>> Make sure to CC the dri-devel mailing list when reviewing this.
>>>> Got it,
>>>>
>>>> I was also curious if we want to move the reset_ctx structure 
>>>> itself to DRM layer, like
>>>> drm_reset_event_ctx {
>>>>     u32 pid;
>>>>     u32 flags;
>>>>     char process_name[64];
>>>> };
>>>
>>> I was entertaining that thought as well.
>>>
>>> But if we do this I would go even a step further and also move the 
>>> reset work item into the DRM layer as well.
>>>
>>> You might also look like into migrating the exiting i915 code which 
>>> uses udev to signal GPU resets to this function as well.
>>>
>>> Regards,
>>> Christian.
>>
>> That seems like a good idea, let me quickly dive into i915 and check 
>> this out.
>>
>> Shashank
>
> I had a quick look at I915, and it looks like both I915 and AMDGPU 
> drivers have very different methods of passing the data to the work 
> function, via different internal structures. Which means it would be 
> much additional work in both the drivers to move the work function 
> itself in the DRM layer.
>
> To me, now it seems like it would be better if we can just provide 
> this interface to send the uevent and its structure, and the drivers 
> can collect their information and pass it to WQ in their own way.
>
> How do you feel about it ?

That does not sounds like a good approach to me. If we add common drm 
functionality then we need to take the existing drivers into account.

What driver specific information does i915 pass to the work function?

Christian.

>
> - Shashank
>
>>>
>>>>
>>>> and then:
>>>> void drm_sysfs_reset_event(struct drm_device *dev, 
>>>> drm_reset_event_ctx *ctx);
>>>>
>>>>>
>>>>>>
>>>>>> Cc: Alexandar Deucher <alexander.deucher at amd.com>
>>>>>> Cc: Christian Koenig <christian.koenig at amd.com>
>>>>>> Signed-off-by: Shashank Sharma <shashank.sharma at amd.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/drm_sysfs.c | 24 ++++++++++++++++++++++++
>>>>>>   include/drm/drm_sysfs.h     |  3 +++
>>>>>>   2 files changed, 27 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/drm_sysfs.c 
>>>>>> b/drivers/gpu/drm/drm_sysfs.c
>>>>>> index 430e00b16eec..52a015161431 100644
>>>>>> --- a/drivers/gpu/drm/drm_sysfs.c
>>>>>> +++ b/drivers/gpu/drm/drm_sysfs.c
>>>>>> @@ -409,6 +409,30 @@ void drm_sysfs_hotplug_event(struct 
>>>>>> drm_device *dev)
>>>>>>   }
>>>>>>   EXPORT_SYMBOL(drm_sysfs_hotplug_event);
>>>>>> +/**
>>>>>> + * drm_sysfs_reset_event - generate a DRM uevent to indicate GPU 
>>>>>> reset
>>>>>> + * @dev: DRM device
>>>>>> + * @pid: The process ID involve with the reset
>>>>>> + * @flags: Any other information about the GPU status
>>>>>> + *
>>>>>> + * Send a uevent for the DRM device specified by @dev. This 
>>>>>> indicates
>>>>>> + * user that a GPU reset has occurred, so that the interested 
>>>>>> client
>>>>>> + * can take any recovery or profiling measure, when required.
>>>>>> + */
>>>>>> +void drm_sysfs_reset_event(struct drm_device *dev, uint64_t pid, 
>>>>>> uint32_t flags)
>>>>>
>>>>> The PID is usually only 32bit, but even better would be to use pid_t.
>>>>>
>>>>>> +{
>>>>>> +    unsigned char pid_str[21], flags_str[15];
>>>>>> +    unsigned char reset_str[] = "RESET=1";
>>>>>> +    char *envp[] = { reset_str, pid_str, flags_str, NULL };
>>>>>> +
>>>>>> +    DRM_DEBUG("generating reset event\n");
>>>>>> +
>>>>>> +    snprintf(pid_str, ARRAY_SIZE(pid_str), "PID=%lu", pid);
>>>>>> +    snprintf(flags_str, ARRAY_SIZE(flags_str), "FLAGS=%u", flags);
>>>>>> + kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
>>>>>> +}
>>>>>> +EXPORT_SYMBOL(drm_sysfs_reset_event);
>>>>>> +
>>>>>>   /**
>>>>>>    * drm_sysfs_connector_hotplug_event - generate a DRM uevent 
>>>>>> for any connector
>>>>>>    * change
>>>>>> diff --git a/include/drm/drm_sysfs.h b/include/drm/drm_sysfs.h
>>>>>> index 6273cac44e47..63f00fe8054c 100644
>>>>>> --- a/include/drm/drm_sysfs.h
>>>>>> +++ b/include/drm/drm_sysfs.h
>>>>>> @@ -2,6 +2,8 @@
>>>>>>   #ifndef _DRM_SYSFS_H_
>>>>>>   #define _DRM_SYSFS_H_
>>>>>> +#define DRM_GPU_RESET_FLAG_VRAM_VALID (1 << 0)
>>>>>
>>>>> Probably better to define that the other way around, e.g. 
>>>>> DRM_GPU_RESET_FLAG_VRAM_LOST.
>>>>>
>>>>> Apart from that looks good to me.
>>>>>
>>>> Got it, noted.
>>>> - Shashank
>>>>
>>>>> Christian.
>>>>>
>>>>>> +
>>>>>>   struct drm_device;
>>>>>>   struct device;
>>>>>>   struct drm_connector;
>>>>>> @@ -11,6 +13,7 @@ int drm_class_device_register(struct device *dev);
>>>>>>   void drm_class_device_unregister(struct device *dev);
>>>>>>   void drm_sysfs_hotplug_event(struct drm_device *dev);
>>>>>> +void drm_sysfs_reset_event(struct drm_device *dev, uint64_t pid, 
>>>>>> uint32_t reset_flags);
>>>>>>   void drm_sysfs_connector_hotplug_event(struct drm_connector 
>>>>>> *connector);
>>>>>>   void drm_sysfs_connector_status_event(struct drm_connector 
>>>>>> *connector,
>>>>>>                         struct drm_property *property);
>>>>>
>>>