[PATCH v6 0/8] drm/xe/guc: Add GuC based register capture for error capture

Dong, Zhanjun zhanjun.dong at intel.com
Wed Mar 27 17:46:42 UTC 2024



On 2024-03-26 11:52 a.m., Souza, Jose wrote:
> On Tue, 2024-03-26 at 11:40 -0400, Dong, Zhanjun wrote:
>> See my comments inline below.
>>
>> Regards,
>> Zhanjun
>> On 2024-03-22 5:36 p.m., Dong, Zhanjun wrote:
>>> See my comments below.
>>>
>>> Regards,
>>> Zhanjun
>>>
>>> On 2024-03-22 5:28 p.m., Souza, Jose wrote:
>>>> On Fri, 2024-03-22 at 11:37 -0700, José Roberto de Souza wrote:
>>>>> On Fri, 2024-03-22 at 09:57 -0700, José Roberto de Souza wrote:
>>>>>> On Tue, 2024-03-19 at 07:36 -0700, Zhanjun Dong wrote:
>>>>>>> Port GuC based register capture for error capture from i915 to Xe.
>>>>>>>
>>>>>>> There are 3 parts inside:
>>>>>>> . Prepare for capture registers
>>>>>>>       There is a bo create at guc ads init time, that is very early
>>>>>>>       and engi  ne map is not ready, make it hard to calculate the
>>>>>>>       capture buffer size, new function created for worst case size
>>>>>>>       caluation. Other than that, this part basically follows the i915
>>>>>>>       design.
>>>>>>> . Process capture notification message
>>>>>>>       Basically follows i915 design
>>>>>>> . Sysfs command process.
>>>>>>>       Xe switched to devcoredump, adopted command line process with
>>>>>>>       captured node list.
>>>>>>
>>>>>> Just some notes after trying this to debug a issue with this series:
>>>>>>
>>>>>> - CONFIG_DRM_XE_CAPTURE_ERROR is not a good name for a config to
>>>>>> enable GuC based capture.
>>>>>>      - in my opinion this should not even exist, always capture with
>>>>>> GuC if xe_device_uc_enabled() is true.
>>
>> 3 reasons:
>> 1. This GuC based register error capture is for debug purpose, does it
>> still important for production stage? I want to leave it an option for
>> users who don't need it.
> 
> definitely needed in production
> 
>> 2. With this opiton disabled, extra memory could be saved. It might not
>> be a big deal for PC, while might be valuable for customers who want to
>> save few 10KBs to 100KB RAM usage.
> 
> not relevant amount, having a accurate register dump is more important as allow us to debug issues reproduced by costumers.
> 
>> 3. The feature with an opiton is what we have in i915, where the feature
>> is ported from. So make it the same style to avoid users migrated from
>> i915 complains about it.

Although I still think for users need it they can keep the config option 
enabled, but after review of i915 code, the kconfig option is not bring 
in by the feature. Thus no need to add the option here.
I will make another rev with this option removed.
>>
>>>>>>
>>>>>> - coredump don't have some registers like SAMPLER_INSTDONE,
>>>>>> ROW_INSTDONE, XEHPG_INSTDONE_GEOM_SVG...
>>>
>>> Yes, those steering register print out is missing.
>>>
>>>>>> Here the output in LNL:
>>>>>
>>>>> ah this could explain...
>>>>>
>>>>> [  152.189386] xe 0000:00:02.0:
>>>>> [drm:xe_hw_engine_snapshot_from_capture [xe]] GT0: GuC error capture
>>>>> is empty, take snapshot from engine.
>>>>> [  152.190215] xe 0000:00:02.0: [drm] Xe device coredump has been
>>>>> created
>>>>>
>>>>> but both code paths needs to dump the same registers
>>>
>>> I found you have another patch:
>>> drm/xe/devcoredump: Print errno if VM snapshot was not captured
>>> I will take a look.
>>>
>>>>
>>>> Hi Zhanjun
>>>>
>>>> Mesa team needs the INSTDONE registers to debug issues and I have
>>>> patches implementing the dump of this registers.
>>>> Do you mind if I send this patches the list?
>>>
>>> Sure, please send it.

Since you already send the INSTONE patch, my next revision will not 
conver this part and will get merged with your patch later.

>>>
>>>>
>>>> I guess this GuC based register capture will need another version and
>>>> someone with experience on GuC to review, so it will take a while...
>>>>
>>>>>
>>>>>>
>>>>>> **** HW Engines ****
>>>>>> rcs0 (physical), logical instance=0
>>>>>>      Forcewake: domain 0x2, ref 1
>>>>>>      HWSTAM: 0xffffffff
>>>>>>      RING_HWS_PGA: 0x01693000
>>>>>>      RING_START: 0x029d5000
>>>>>>      RING_HEAD: 0x00001914
>>>>>>      RING_TAIL: 0x00001968
>>>>>>      RING_CTL: 0x00003001
>>>>>>      RING_MI_MODE: 0x00001000
>>>>>>      RING_MODE: 0x00000008
>>>>>>      RING_IMR: 0x00000000
>>>>>>      RING_ESR: 0x00000000
>>>>>>      RING_EMR: 0xffffffff
>>>>>>      RING_EIR: 0x00000000
>>>>>>      IPEHR: 0x7a000004
>>>>>>      ACTHD: 0x0000effeffff9208
>>>>>>      RING_BBADDR: 0x0000effeffff9209
>>>>>>      RING_DMA_FADD: 0x00000000029d6968
>>>>>>      RING_EXECLIST_STATUS: 0x000000004000279c
>>>>>>      RING_EXECLIST_SQ_CONTENTS: 0x00000000029d9719
>>>>>>
>>>>>> - a lot of style issues, see 'CI.checkpatch'
>>
>> After review the CI.checkpatch output, all style issues reported is by
>> purpose.
>>
>>>>>>
>>>>>>>
>>>>>>> Signed-off-by: Zhanjun Dong <zhanjun.dong at intel.com>
>>>>>>> Cc: Alan Previn <alan.previn.teres.alexis at intel.com>
>>>>>>>
>>>>>>> Changes from prior revs:
>>>>>>>     v6:-  Change hardcoded register snapshot fill to follow mapping
>>>>>>> tables
>>>>>>>           When capture is empty, take snapshot from engine
>>>>>>>     v5:-  Split dss helper code out as an standalone patch
>>>>>>>           Remove old platform registers definition.
>>>>>>>           Split register map table to 32 and 64bit each
>>>>>>>     v4:-  Move register map table to xe_hw_engine.c
>>>>>>>     v3:-  Remove condition compilation in code
>>>>>>>     v2:-  Split into multiple chunks
>>>>>>>
>>>>>>> Zhanjun Dong (8):
>>>>>>>     drm/xe/guc: Add kconfig for GuC based register capture
>>>>>>>     drm/xe/guc: Update GuC ADS size for error capture
>>>>>>>     drm/xe/guc: Add XE_LP steered register lists
>>>>>>>     drm/xe/guc: Add capture size check in GuC log buffer
>>>>>>>     drm/xe/guc: Check sizing of guc_capture output
>>>>>>>     drm/xe/guc: Extract GuC error capture lists on G2H notification
>>>>>>>     drm/xe/guc: Pre-allocate output nodes for extraction
>>>>>>>     drm/xe/guc: Plumb GuC-capture into dev coredump
>>>>>>>
>>>>>>>    drivers/gpu/drm/xe/Kconfig               |   11 +
>>>>>>>    drivers/gpu/drm/xe/Makefile              |    1 +
>>>>>>>    drivers/gpu/drm/xe/abi/guc_actions_abi.h |    7 +
>>>>>>>    drivers/gpu/drm/xe/regs/xe_gt_regs.h     |    5 +
>>>>>>>    drivers/gpu/drm/xe/xe_gt_printk.h        |    3 +
>>>>>>>    drivers/gpu/drm/xe/xe_guc.c              |    6 +
>>>>>>>    drivers/gpu/drm/xe/xe_guc_ads.c          |  229 +++-
>>>>>>>    drivers/gpu/drm/xe/xe_guc_ads_types.h    |    2 +
>>>>>>>    drivers/gpu/drm/xe/xe_guc_capture.c      | 1332
>>>>>>> ++++++++++++++++++++++
>>>>>>>    drivers/gpu/drm/xe/xe_guc_capture.h      |   21 +
>>>>>>>    drivers/gpu/drm/xe/xe_guc_capture_fwif.h |  221 ++++
>>>>>>>    drivers/gpu/drm/xe/xe_guc_ct.c           |    2 +
>>>>>>>    drivers/gpu/drm/xe/xe_guc_fwif.h         |   68 ++
>>>>>>>    drivers/gpu/drm/xe/xe_guc_log.c          |  179 +++
>>>>>>>    drivers/gpu/drm/xe/xe_guc_log.h          |   15 +
>>>>>>>    drivers/gpu/drm/xe/xe_guc_log_types.h    |   26 +
>>>>>>>    drivers/gpu/drm/xe/xe_guc_submit.c       |   21 +-
>>>>>>>    drivers/gpu/drm/xe/xe_guc_submit.h       |    3 +
>>>>>>>    drivers/gpu/drm/xe/xe_guc_types.h        |    2 +
>>>>>>>    drivers/gpu/drm/xe/xe_hw_engine.c        |  261 ++++-
>>>>>>>    drivers/gpu/drm/xe/xe_hw_engine.h        |    4 +
>>>>>>>    drivers/gpu/drm/xe/xe_hw_engine_types.h  |  117 +-
>>>>>>>    22 files changed, 2421 insertions(+), 115 deletions(-)
>>>>>>>    create mode 100644 drivers/gpu/drm/xe/xe_guc_capture.c
>>>>>>>    create mode 100644 drivers/gpu/drm/xe/xe_guc_capture.h
>>>>>>>    create mode 100644 drivers/gpu/drm/xe/xe_guc_capture_fwif.h
>>>>>>>
>>>>>>
>>>>>
>>>>
> 


More information about the Intel-xe mailing list