[PATCH 1/2] drm: Add GPU reset sysfs event
Andrey Grodzovsky
andrey.grodzovsky at amd.com
Tue Mar 8 17:17:08 UTC 2022
On 2022-03-08 12:04, Somalapuram, Amaranath wrote:
>
> On 3/8/2022 10:27 PM, Sharma, Shashank wrote:
>>
>>
>> On 3/8/2022 5:55 PM, Andrey Grodzovsky wrote:
>>> You can read on their side here -
>>> https://www.phoronix.com/scan.php?page=news_item&px=AMD-STB-Linux-5.17
>>> and see their patch. THey don't have as clean
>>> interface as we do to retrieve the buffer and currently it's
>>> hard-coded for debugfs dump but it looks like pretty straight
>>> forward to expose their buffer to external
>>> client like amdgpu.
>>
> Customer requirement is to get reset notification for there daemon
> with other info (like PID process name vram status).
In general when a failure happens we want to have all debug info
possible to have better ability to root cause the problem. Since this is
an open forum I am not sure how
much i can disclose about the data in the buffer but i guarantee you it
is very useful for debugging GPU hang causes.
Andrey
>
> Regards,
> S.Amarnath
>> Noted, thanks for the pointer.
>> - Shashank
>>>
>>> Andrey
>>>
>>> On 2022-03-08 11:46, Sharma, Shashank wrote:
>>>> I have a very limited understanding of PMC driver and its
>>>> interfaces, so I would just go ahead and rely on Andrey's
>>>> judgement/recommendation on this :)
>>>>
>>>> - Shashank
>>>>
>>>> On 3/8/2022 5:39 PM, Andrey Grodzovsky wrote:
>>>>> As long as PMC driver provides clear interface to retrieve the
>>>>> info there should be no issue to call either amdgpu interface or
>>>>> PMC interface using IS_APU (or something alike in the code)
>>>>> We probably should add a wrapper function around this logic in
>>>>> amdgpu.
>>>>>
>>>>> Andrey
>>>>>
>>>>> On 2022-03-08 11:36, Lazar, Lijo wrote:
>>>>>>
>>>>>> [AMD Official Use Only]
>>>>>>
>>>>>>
>>>>>> +Mario
>>>>>>
>>>>>> I guess that means the functionality needs to be present in
>>>>>> amdgpu for APUs also. Presently, this is taken care by PMC driver
>>>>>> for APUs.
>>>>>>
>>>>>> Thanks,
>>>>>> Lijo
>>>>>> ------------------------------------------------------------------------
>>>>>>
>>>>>> *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf
>>>>>> of Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>>>>> *Sent:* Tuesday, March 8, 2022 9:55:03 PM
>>>>>> *To:* Shashank Sharma <contactshashanksharma at gmail.com>;
>>>>>> amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
>>>>>> *Cc:* Deucher, Alexander <Alexander.Deucher at amd.com>;
>>>>>> Somalapuram, Amaranath <Amaranath.Somalapuram at amd.com>; Koenig,
>>>>>> Christian <Christian.Koenig at amd.com>; Sharma, Shashank
>>>>>> <Shashank.Sharma at amd.com>
>>>>>> *Subject:* Re: [PATCH 1/2] drm: Add GPU reset sysfs event
>>>>>>
>>>>>> On 2022-03-07 11:26, Shashank Sharma wrote:
>>>>>> > From: Shashank Sharma <shashank.sharma at amd.com>
>>>>>> >
>>>>>> > This patch adds a new sysfs event, which will indicate
>>>>>> > the userland about a GPU reset, and can also provide
>>>>>> > some information like:
>>>>>> > - which PID was involved in the GPU reset
>>>>>> > - what was the GPU status (using flags)
>>>>>> >
>>>>>> > This patch also introduces the first flag of the flags
>>>>>> > bitmap, which can be appended as and when required.
>>>>>>
>>>>>>
>>>>>> I am reminding again about another important piece of info which
>>>>>> you can add
>>>>>> here and that is Smart Trace Buffer dump [1]. The buffer size is HW
>>>>>> specific but
>>>>>> from what I see there is no problem to just amend it as part of
>>>>>> envp[]
>>>>>> initialization.
>>>>>> bellow.
>>>>>>
>>>>>> The interface to get the buffer is smu_stb_collect_info and usage
>>>>>> can be
>>>>>> seen from
>>>>>> frebugfs interface in smu_stb_debugfs_open
>>>>>>
>>>>>> [1] -
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.spinics.net%2Flists%2Famd-gfx%2Fmsg70751.html&data=04%7C01%7Clijo.lazar%40amd.com%7C80bc3f07e2d0441d44a108da012036dc%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637823535167679490%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=53l7KlTf%2BICKkZkLVwFh6nRTjkAh%2FDpOat5DRoyKIx0%3D&reserved=0
>>>>>> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.spinics.net%2Flists%2Famd-gfx%2Fmsg70751.html&data=04%7C01%7Clijo.lazar%40amd.com%7C80bc3f07e2d0441d44a108da012036dc%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637823535167679490%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=53l7KlTf%2BICKkZkLVwFh6nRTjkAh%2FDpOat5DRoyKIx0%3D&reserved=0>
>>>>>>
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>> >
>>>>>> > Cc: Alexandar Deucher <alexander.deucher at amd.com>
>>>>>> > Cc: Christian Koenig <christian.koenig at amd.com>
>>>>>> > Signed-off-by: Shashank Sharma <shashank.sharma at amd.com>
>>>>>> > ---
>>>>>> > drivers/gpu/drm/drm_sysfs.c | 24 ++++++++++++++++++++++++
>>>>>> > include/drm/drm_sysfs.h | 3 +++
>>>>>> > 2 files changed, 27 insertions(+)
>>>>>> >
>>>>>> > diff --git a/drivers/gpu/drm/drm_sysfs.c
>>>>>> b/drivers/gpu/drm/drm_sysfs.c
>>>>>> > index 430e00b16eec..52a015161431 100644
>>>>>> > --- a/drivers/gpu/drm/drm_sysfs.c
>>>>>> > +++ b/drivers/gpu/drm/drm_sysfs.c
>>>>>> > @@ -409,6 +409,30 @@ void drm_sysfs_hotplug_event(struct
>>>>>> drm_device *dev)
>>>>>> > }
>>>>>> > EXPORT_SYMBOL(drm_sysfs_hotplug_event);
>>>>>> >
>>>>>> > +/**
>>>>>> > + * drm_sysfs_reset_event - generate a DRM uevent to indicate
>>>>>> GPU reset
>>>>>> > + * @dev: DRM device
>>>>>> > + * @pid: The process ID involve with the reset
>>>>>> > + * @flags: Any other information about the GPU status
>>>>>> > + *
>>>>>> > + * Send a uevent for the DRM device specified by @dev. This
>>>>>> indicates
>>>>>> > + * user that a GPU reset has occurred, so that the interested
>>>>>> client
>>>>>> > + * can take any recovery or profiling measure, when required.
>>>>>> > + */
>>>>>> > +void drm_sysfs_reset_event(struct drm_device *dev, uint64_t
>>>>>> pid, uint32_t flags)
>>>>>> > +{
>>>>>> > + unsigned char pid_str[21], flags_str[15];
>>>>>> > + unsigned char reset_str[] = "RESET=1";
>>>>>> > + char *envp[] = { reset_str, pid_str, flags_str, NULL };
>>>>>> > +
>>>>>> > + DRM_DEBUG("generating reset event\n");
>>>>>> > +
>>>>>> > + snprintf(pid_str, ARRAY_SIZE(pid_str), "PID=%lu", pid);
>>>>>> > + snprintf(flags_str, ARRAY_SIZE(flags_str), "FLAGS=%u",
>>>>>> flags);
>>>>>> > + kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE,
>>>>>> envp);
>>>>>> > +}
>>>>>> > +EXPORT_SYMBOL(drm_sysfs_reset_event);
>>>>>> > +
>>>>>> > /**
>>>>>> > * drm_sysfs_connector_hotplug_event - generate a DRM uevent
>>>>>> for any connector
>>>>>> > * change
>>>>>> > diff --git a/include/drm/drm_sysfs.h b/include/drm/drm_sysfs.h
>>>>>> > index 6273cac44e47..63f00fe8054c 100644
>>>>>> > --- a/include/drm/drm_sysfs.h
>>>>>> > +++ b/include/drm/drm_sysfs.h
>>>>>> > @@ -2,6 +2,8 @@
>>>>>> > #ifndef _DRM_SYSFS_H_
>>>>>> > #define _DRM_SYSFS_H_
>>>>>> >
>>>>>> > +#define DRM_GPU_RESET_FLAG_VRAM_VALID (1 << 0)
>>>>>> > +
>>>>>> > struct drm_device;
>>>>>> > struct device;
>>>>>> > struct drm_connector;
>>>>>> > @@ -11,6 +13,7 @@ int drm_class_device_register(struct device
>>>>>> *dev);
>>>>>> > void drm_class_device_unregister(struct device *dev);
>>>>>> >
>>>>>> > void drm_sysfs_hotplug_event(struct drm_device *dev);
>>>>>> > +void drm_sysfs_reset_event(struct drm_device *dev, uint64_t
>>>>>> pid, uint32_t reset_flags);
>>>>>> > void drm_sysfs_connector_hotplug_event(struct drm_connector
>>>>>> *connector);
>>>>>> > void drm_sysfs_connector_status_event(struct drm_connector
>>>>>> *connector,
>>>>>> > struct drm_property
>>>>>> *property);
More information about the amd-gfx
mailing list