[PATCH 3/3] drm/xe: Drop duplicated information about GT tile in devcoredump

Thu Jan 23 20:59:47 UTC 2025

On 1/23/2025 11:35, Souza, Jose wrote:
> On Thu, 2025-01-23 at 11:27 -0800, John Harrison wrote:
>> On 1/23/2025 10:56, Souza, Jose wrote:
>>> On Thu, 2025-01-23 at 10:30 -0800, John Harrison wrote:
>>>> On 1/23/2025 10:24, Souza, Jose wrote:
>>>>> On Thu, 2025-01-23 at 10:18 -0800, John Harrison wrote:
>>>>>> On 1/23/2025 09:59, José Roberto de Souza wrote:
>>>>>>> The GT tile is already available but was added again in commit
>>>>>>> c28fd6c358db ("drm/xe/devcoredump: Improve section headings and add tile info"),
>>>>>>> so here deleting it.
>>>>>>>
>>>>>>> Here a devcoredump example with the duplicated information:
>>>>>>>
>>>>>>> **** Xe Device Coredump ****
>>>>>>> Reason: Timedout job - seqno=4294967170, lrc_seqno=4294967170, guc_id=13, flags=0x0
>>>>>>> kernel: 6.13.0-zeh-xe+
>>>>>>> module: xe
>>>>>>> Snapshot time: 1737573530.243521319
>>>>>>> Uptime: 2588.041930284
>>>>>>> Process: deqp-vk [8850]
>>>>>>> PCI ID: 0x64a0
>>>>>>> PCI revision: 0x04
>>>>>>> GT id: 0
>>>>>>> 	Tile: 0
>>>>>>> 	Type: main
>>>>>>> 	IP ver: 20.4.4
>>>>>>> 	CS reference clock: 19200000
>>>>>>> GT id: 1
>>>>>>> 	Tile: 0
>>>>>>> 	Type: media
>>>>>>> 	IP ver: 20.0.4
>>>>>>> 	CS reference clock: 19200000
>>>>>> This is an overview of all GTs/tiles in the device within the global
>>>>>> section.
>>>>>>
>>>>>>> **** GT #0 ****
>>>>>> This is a section header telling you that everything which follows is
>>>>>> inside GT0.
>>>>>>
>>>>>> It is not duplicated information. And if you remove it then you now have
>>>>>> the information of all GTs back to back with no indication of which GT
>>>>>> they actually belong to.
>>>>> Can't you get this information from Name + class + instance? if not that should be placed in one of the sections below.
>>>> No. I was trying to do that originally and determined it was impossible
>>>> to do reliably for all current and future platforms. Hence the section
>>>> header was added.
>>>>
>>>> It is also much, much, much better to be explicit about debug
>>>> information than force people to guess based on heuristics or tribal
>>>> knowledge.
>>>>
>>>>> So this should be placed in one of this sections:
>>>> No. It is not information within a context or within a hardware engine.
>>>> It is the reverse. All the following contexts, hardware engines, etc.
>>>> are within the GT.
>>>>
>>>> And when you have multiple GTs in a single dump then you need a definite
>>>> delimiter to say that all the following is now in a different GT to what
>>>> came before.
>>> Can't I do a exec over a exec_queue created over engines in different tiles?
>>> I think we are allowed, so having tile and gt in '**** HW Engines ****' would be better.
>> Not that I am aware of. The whole point of the 'hw engine' section is
>> that it is related to a command streamer. And those cannot bridge GTs.
> No, I mean this:
>
> - create a exec_queue with instances in CCS0(tile 0, gt 0) + CCS5(tile 1, gt 1).
There is a difference between a gem context (are they still called that 
in Xe?) and a hardware context. It might be possible to create a top 
level context that operates across multiple engines. But underneath, 
that connects to multiple hw contexts, one per hardware engine.

For certain, the HW engine dump (xe_engine_snapshot_print) starts with 
'gt = snapshot->hwe->gt'. The whole thing is very definitely contained 
within a GT.

So in your example, you would (hopefully) get two separate hw engine 
dumps - one for each CCS. And each would be within a single GT. 
Although, I think at the moment, the GPU hang infrastructure is all GT 
focused as well. So actually, I think you would get the dump from the 
hardware engine that caused the hang and nothing from the other engine 
which did not. I have further patches for improving the handling of 
multi-GT coredumps that I haven't got to posting yet.

John.

> - do a exec with a batch buffer CCS0 + other batch buffer for CCS5
> - one of those gets hang
> - devcoredump should dump information of both engines("xe_engine_snapshot_capture_for_queue(struct xe_exec_queue *q)")
>
> so having tile and gt information in **** HW Engines **** would be better.
>
> but we can work on this after the GuC log part.
>
>> John.
>>
>>>> John.
>>>>
>>>>> GuC ID: 13
>>>>> 	Name: ccs13
>>>>> 	Class: 5
>>>>> 	Logical mask: 0x1
>>>>> 	Width: 1
>>>>> 	Ref: 2
>>>>> 	Timeout: 5000 (ms)
>>>>> 	Timeslice: 1000 (us)
>>>>> 	Preempt timeout: 640000 (us)
>>>>> 	HW Context Desc: 0x025e0000
>>>>> 	HW Ring address: 0x025dc000
>>>>> 	HW Indirect Ring State: 0x025e3000
>>>>> 	LRC Head: (memory) 152
>>>>> 	LRC Tail: (internal) 296, (memory) 296
>>>>> 	Ring start: (memory) 0x025dc000
>>>>> 	Start seqno: (memory) -126
>>>>> 	Seqno: (memory) -127
>>>>> 	Timestamp: 0x0000035e
>>>>> 	Job Timestamp: 0x0000035e
>>>>> 	Schedule State: 0x441
>>>>> 	Flags: 0x0
>>>>>
>>>>> **** HW Engines ****
>>>>> ccs0 (physical), logical instance=0
>>>>> 	Capture_source: GuC
>>>>> 	Coverage: full-capture
>>>>> 	Forcewake: domain 0x2, ref 1
>>>>> 	Reserved: no
>>>>> 	FORCEWAKE_GT: 0x00010000
>>>>> 	RCU_MODE: 0x00000001
>>>>> 	HWSTAM: 0xffffffff
>>>>> 	RING_HWS_PGA: 0x018db000
>>>>> 	RING_HEAD: 0x000000ec
>>>>> 	RING_TAIL: 0x00000128
>>>>> 	RING_CTL: 0x00003001
>>>>> 	RING_MI_MODE: 0x00001000
>>>>> 	RING_MODE: 0x00000008
>>>>> 	RING_ESR: 0x00000000
>>>>> 	RING_EMR: 0xffffffff
>>>>> 	RING_EIR: 0x00000000
>>>>> 	RING_IMR: 0x00000000
>>>>> 	IPEHR: 0x7a000a04
>>>>> 	RING_INSTDONE: 0xffdefffe
>>>>>
>>>>>> John.
>>>>>>
>>>>>>
>>>>>>> 	Tile: 0
>>>>>>>
>>>>>>> **** GuC Log ****
>>>>>>> ....
>>>>>>>
>>>>>>> Cc: John Harrison <John.C.Harrison at Intel.com>
>>>>>>> Cc: Lucas De Marchi <lucas.demarchi at intel.com>
>>>>>>> Signed-off-by: José Roberto de Souza <jose.souza at intel.com>
>>>>>>> ---
>>>>>>>      drivers/gpu/drm/xe/xe_devcoredump.c | 3 ---
>>>>>>>      1 file changed, 3 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c b/drivers/gpu/drm/xe/xe_devcoredump.c
>>>>>>> index 1c86e6456d60f..2996945ffee39 100644
>>>>>>> --- a/drivers/gpu/drm/xe/xe_devcoredump.c
>>>>>>> +++ b/drivers/gpu/drm/xe/xe_devcoredump.c
>>>>>>> @@ -111,9 +111,6 @@ static ssize_t __xe_devcoredump_read(char *buffer, size_t count,
>>>>>>>      	drm_printf(&p, "Process: %s [%d]\n", ss->process_name, ss->pid);
>>>>>>>      	xe_device_snapshot_print(xe, &p);
>>>>>>>      
>>>>>>> -	drm_printf(&p, "\n**** GT #%d ****\n", ss->gt->info.id);
>>>>>>> -	drm_printf(&p, "\tTile: %d\n", ss->gt->tile->id);
>>>>>>> -
>>>>>>>      	drm_puts(&p, "\n**** GuC Log ****\n");
>>>>>>>      	xe_guc_log_snapshot_print(ss->guc.log, &p);
>>>>>>>      	drm_puts(&p, "\n**** GuC CT ****\n");