[PATCH 1/2] drm/xe: Improve devcoredump documentation

Fri Nov 1 18:39:59 UTC 2024

On 11/1/2024 08:07, Raag Jadav wrote:
> On Fri, Nov 01, 2024 at 07:44:37AM -0500, Lucas De Marchi wrote:
>> On Fri, Nov 01, 2024 at 07:47:54AM +0200, Raag Jadav wrote:
>>> On Thu, Oct 31, 2024 at 11:29:15AM -0700, Lucas De Marchi wrote:
>>>
>>> ...
>>>
>>>> - * Snapshot at hang:
>>>> - * The 'data' file is printed with a drm_printer pointer at devcoredump read
>>>> - * time. For this reason, we need to take snapshots from when the hang has
>>>> - * happened, and not only when the user is reading the file. Otherwise the
>>>> - * information is outdated since the resets might have happened in between.
>>>> + * The following characteristics are observed by xe when creating a device
>>>> + * coredump:
>>>>    *
>>>> - * 'First' failure snapshot:
>>>> - * In general, the first hang is the most critical one since the following hangs
>>>> - * can be a consequence of the initial hang. For this reason we only take the
>>>> - * snapshot of the 'first' failure and ignore subsequent calls of this function,
>>>> - * at least while the coredump device is alive. Dev_coredump has a delayed work
>>>> - * queue that will eventually delete the device and free all the dump
>>>> - * information.
>>>> + * **Snapshot at hang**:
>>>> + *   The 'data' file contains a snapshot of the HW state at the time the hang
>>>> + *   happened. Due to the driver recovering from resets/crashes, it may not
>>>> + *   correspond to the state of when the file is read by userspace.
>>> Does that mean the devcoredump will be present even after a successful recovery?
>> yes.... if it's not succesful then it's moved to the wedged state. Easy
>> way to test is running this:
>>
>> 	xe_exec_threads --r threads-hang-basic
>>
>> You should see something like this in your dmesg:
>>
>> 	[IGT] xe_exec_threads: starting subtest threads-hang-basic
>> 	xe 0000:00:02.0: [drm] GT0: Engine reset: engine_class=rcs, logical_mask: 0x1, guc_id=34
>> 	xe 0000:00:02.0: [drm] GT0: Engine reset: engine_class=bcs, logical_mask: 0x1, guc_id=32
>> 	xe 0000:00:02.0: [drm] GT1: Engine reset: engine_class=vcs, logical_mask: 0x1, guc_id=18
>> 	xe 0000:00:02.0: [drm] GT0: Timedout job: seqno=4294967169, lrc_seqno=4294967169, guc_id=34, flags=0x0 in xe_exec_threads [2636]
>> 	xe 0000:00:02.0: [drm] GT1: Engine reset: engine_class=vecs, logical_mask: 0x1, guc_id=17
>> 	xe 0000:00:02.0: [drm] GT1: Timedout job: seqno=4294967169, lrc_seqno=4294967169, guc_id=18, flags=0x0 in xe_exec_threads [2636]
>> 	xe 0000:00:02.0: [drm] Xe device coredump has been created
>> -->	xe 0000:00:02.0: [drm] Check your /sys/class/drm/card0/device/devcoredump/data
>> 	xe 0000:00:02.0: [drm] GT1: Timedout job: seqno=4294967169, lrc_seqno=4294967169, guc_id=17, flags=0x0 in xe_exec_threads [2636]
>> 	xe 0000:00:02.0: [drm] GT0: Timedout job: seqno=4294967169, lrc_seqno=4294967169, guc_id=32, flags=0x0 in xe_exec_threads [2636]
>> 	xe 0000:00:02.0: [drm] GT0: Engine reset: engine_class=ccs, logical_mask: 0x1, guc_id=27
>> 	xe 0000:00:02.0: [drm] GT0: Timedout job: seqno=4294967169, lrc_seqno=4294967169, guc_id=27, flags=0x0 in xe_exec_threads [2636]
>> 	[IGT] xe_exec_threads: finished subtest threads-hang-basic, SUCCESS
>>
>>
>> If you run it again, it won't overwrite the previous dump, until user
>> cleans the previous dump or the timeout on the kernel side fires to
>> release it.
> Yes, which I think we're covering at later point in "First failure only".
> So maybe establishing the mechanism itself before explaining reset/recovery
> would be a bit neater...
>
>>  From a distro-integration pov, I think it should have a udev rule that
>> fires when a devcoredump is created so the dump is copied to persistent
>> storage. Just like it happens with cpu coredump (see systemd-coredump)
>>
>>> Perhaps moving the 'release' part to above paragraph will add required context.
>> not sure I follow. Are you suggesting to swap the order of "First
>> failure only" and "Snapshot at hang" ?
> ... in whichever way you think is best.
Note that 'snapshot at hang' and 'first failure only' are totally 
separate concepts. And neither explains the release mechanism. Reversing 
the order of the descriptions would be incorrect, IMHO.

The point of 'snapshot at hang' is to say that the universe continues 
existing after the snapshot is taken. It is not just that the driver 
recovers but that it keeps processing new work. In an active system, it 
is extremely unlikely the system state (hardware or software) would 
match what is in the snapshot by the time the user is able to read the 
snapshot out. That has nothing to do with when or if the snapshot is 
released, nor with how many snapshots are taken.

The point of 'first failure only' is that only one snapshot is taken at 
a time. If there are multiple back to back hangs then only the first 
will generate a snapshot. Further snapshots will only be created for new 
hangs after the existing snapshot has been 'released'. And I'm not 
seeing mention of how to release the snapshot? It would be good to add a 
quick comment about that.

John.

>
>>>> + * **First failure only**:
>>>> + *   In general, the first hang is the most critical one since the following
>>>> + *   hangs can be a consequence of the initial hang. For this reason a snapshot
>>>> + *   is taken only for the first failure. Until the devcoredump is released by
>>>> + *   userspace or kernel, all subsequent hangs do not override the snapshot nor
>>>> + *   create new ones. Devcoredump has a delayed work queue that will eventually
>>>> + *   delete the file node and free all the dump information.
> Raag