[PATCH v2 1/2] drm: Add GPU reset sysfs event

Thu Mar 10 20:17:12 UTC 2022

On 3/10/2022 8:56 PM, Rob Clark wrote:
> On Thu, Mar 10, 2022 at 11:44 AM Sharma, Shashank
> <shashank.sharma at amd.com> wrote:
>>
>>
>>
>> On 3/10/2022 8:35 PM, Rob Clark wrote:
>>> On Thu, Mar 10, 2022 at 11:14 AM Sharma, Shashank
>>> <shashank.sharma at amd.com> wrote:
>>>>
>>>>
>>>>
>>>> On 3/10/2022 7:33 PM, Abhinav Kumar wrote:
>>>>>
>>>>>
>>>>> On 3/10/2022 9:40 AM, Rob Clark wrote:
>>>>>> On Thu, Mar 10, 2022 at 9:19 AM Sharma, Shashank
>>>>>> <shashank.sharma at amd.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 3/10/2022 6:10 PM, Rob Clark wrote:
>>>>>>>> On Thu, Mar 10, 2022 at 8:21 AM Sharma, Shashank
>>>>>>>> <shashank.sharma at amd.com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 3/10/2022 4:24 PM, Rob Clark wrote:
>>>>>>>>>> On Thu, Mar 10, 2022 at 1:55 AM Christian König
>>>>>>>>>> <ckoenig.leichtzumerken at gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Am 09.03.22 um 19:12 schrieb Rob Clark:
>>>>>>>>>>>> On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
>>>>>>>>>>>> <contactshashanksharma at gmail.com> wrote:
>>>>>>>>>>>>> From: Shashank Sharma <shashank.sharma at amd.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>> This patch adds a new sysfs event, which will indicate
>>>>>>>>>>>>> the userland about a GPU reset, and can also provide
>>>>>>>>>>>>> some information like:
>>>>>>>>>>>>> - process ID of the process involved with the GPU reset
>>>>>>>>>>>>> - process name of the involved process
>>>>>>>>>>>>> - the GPU status info (using flags)
>>>>>>>>>>>>>
>>>>>>>>>>>>> This patch also introduces the first flag of the flags
>>>>>>>>>>>>> bitmap, which can be appended as and when required.
>>>>>>>>>>>> Why invent something new, rather than using the already existing
>>>>>>>>>>>> devcoredump?
>>>>>>>>>>>
>>>>>>>>>>> Yeah, that's a really valid question.
>>>>>>>>>>>
>>>>>>>>>>>> I don't think we need (or should encourage/allow) something drm
>>>>>>>>>>>> specific when there is already an existing solution used by both
>>>>>>>>>>>> drm
>>>>>>>>>>>> and non-drm drivers.  Userspace should not have to learn to support
>>>>>>>>>>>> yet another mechanism to do the same thing.
>>>>>>>>>>>
>>>>>>>>>>> Question is how is userspace notified about new available core
>>>>>>>>>>> dumps?
>>>>>>>>>>
>>>>>>>>>> I haven't looked into it too closely, as the CrOS userspace
>>>>>>>>>> crash-reporter already had support for devcoredump, so it "just
>>>>>>>>>> worked" out of the box[1].  I believe a udev event is what triggers
>>>>>>>>>> the crash-reporter to go read the devcore dump out of sysfs.
>>>>>>>>>
>>>>>>>>> I had a quick look at the devcoredump code, and it doesn't look like
>>>>>>>>> that is sending an event to the user, so we still need an event to
>>>>>>>>> indicate a GPU reset.
>>>>>>>>
>>>>>>>> There definitely is an event to userspace, I suspect somewhere down
>>>>>>>> the device_add() path?
>>>>>>>>
>>>>>>>
>>>>>>> Let me check that out as well, hope that is not due to a driver-private
>>>>>>> event for GPU reset, coz I think I have seen some of those in a few DRM
>>>>>>> drivers.
>>>>>>>
>>>>>>
>>>>>> Definitely no driver private event for drm/msm .. I haven't dug
>>>>>> through it all but this is the collector for devcoredump, triggered
>>>>>> somehow via udev.  Most likely from event triggered by device_add()
>>>>>>
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fchromium.googlesource.com%2Fchromiumos%2Fplatform2%2F%2B%2FHEAD%2Fcrash-reporter%2Fudev_collector.cc&data=04%7C01%7Cshashank.sharma%40amd.com%7C3b5c0e8744234962061d08da02d00248%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637825389694363926%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=mWI1Z%2B8eMJApePc5ajRipbGUG9Cw9wXf2FCw6NQxVaM%3D&reserved=0
>>>>>>
>>>>>
>>>>> Yes, that is correct. the uevent for devcoredump is from device_add()
>>>>>
>>>> Yes, I could confirm in the code that device_add() sends a uevent.
>>>>
>>>> kobject_uevent(&dev->kobj, KOBJ_ADD);
>>>>
>>>> I was trying to map the ChromiumOs's udev event rules with the event
>>>> being sent from device_add(), what I could see is there is only one udev
>>>> rule for any DRM subsystem events in ChromiumOs's 99-crash-reporter.rules:
>>>>
>>>> ACTION=="change", SUBSYSTEM=="drm", KERNEL=="card0", ENV{ERROR}=="1", \
>>>>      RUN+="/sbin/crash_reporter
>>>> --udev=KERNEL=card0:SUBSYSTEM=drm:ACTION=change"
>>>>
>>>> Can someone confirm that this is the rule which gets triggered when a
>>>> devcoredump is generated ? I could not find an ERROR=1 string in the
>>>> env[] while sending this event from dev_add();
>>>
>>> I think it is actually this rule:
>>>
>>> ACTION=="add", SUBSYSTEM=="devcoredump", \
>>>     RUN+="/sbin/crash_reporter
>>> --udev=SUBSYSTEM=devcoredump:ACTION=add:KERNEL_NUMBER=%n"
>>>
>>> It is something non-drm specific because it supports devcore dumps
>>> from non drm drivers.  I know at least some of the wifi and remoteproc
>>> drivers use it.
>>>
>>
>> Ah, this seems like a problem for me. I understand it will work for a
>> reset/recovery app well, but if a DRM client (like a compositor), who
>> wants to listen only to DRM events (like a GPU reset), wouldn't this
>> create a lot of noise for it ? Like every time any subsystem produces
>> this coredump, there will be a change in devcoresump subsystem, and the
>> client will have to parse the core file, and then will have to decide if
>> it wants to react to this, or ignore.
>>
>> Wouldn't a GPU reset event, specific to DRM subsystem server better in
>> such case ?
>>
> 
> So, I suppose there are two different use-cases.. for something like
> distro which has generic crash telemetry (ie. when users opt in to
> automated crash reporting), and in general for debugging gpu crashes,
> you want devcoredump, preferably with plenty of information about gpu
> state, etc, so you actually have a chance of debugging problems you
> can't necessarily reproduce locally.  Note also that mesa CI has some
> limited support for collecting devcore dumps if a CI run triggers a
> GPU fault.
> 
> For something like just notifying a compositor that a gpu crash
> happened, perhaps drm_event is more suitable.  See
> virtio_gpu_fence_event_create() for an example of adding new event
> types.  Although maybe you want it to be an event which is not device
> specific.  This isn't so much of a debugging use-case as simply
> notification.

Exactly, in fact that should have been a part of my cover-letter. May be 
I will send a V2 with better doc.

- Shashank

> 
> BR,
> -R