Re: ✗ Xe.CI.Full: failure for Update "force_reset" code (rev3)

Fri May 30 14:27:08 UTC 2025


On 30.05.2025 11:31, Patchwork wrote:
> == Series Details ==
> 
> Series: Update "force_reset" code (rev3)
> URL   : https://patchwork.freedesktop.org/series/149607/
> State : failure
> 
> == Summary ==
> 
> CI Bug Log - changes from XEIGT_8384_FULL -> XEIGTPW_13211_FULL
> ====================================================
> 
> Summary
> -------
> 
>   **FAILURE**
> 
>   Serious unknown changes coming with XEIGTPW_13211_FULL absolutely need to be
>   verified manually.
>   
>   If you think the reported changes have nothing to do with the changes
>   introduced in XEIGTPW_13211_FULL, please notify your bug team (I915-ci-infra at lists.freedesktop.org) to allow them
>   to document this new failure mode, which will reduce false positives in CI.
> 
>   
> 
> Participating hosts (4 -> 3)
> ------------------------------
> 
>   Missing    (1): shard-adlp 
> 
> Possible new issues
> -------------------
> 
>   Here are the unknown changes that may have been introduced in XEIGTPW_13211_FULL:
> 
> ### IGT changes ###
> 
> #### Possible regressions ####
> 
>   * igt at xe_exec_reset@parallel-gt-reset:
>     - shard-bmg:          [PASS][1] -> [DMESG-WARN][2]
>    [1]: https://intel-gfx-ci.01.org/tree/intel-xe/IGT_8384/shard-bmg-4/igt@xe_exec_reset@parallel-gt-reset.html
>    [2]: https://intel-gfx-ci.01.org/tree/intel-xe/IGTPW_13211/shard-bmg-1/igt@xe_exec_reset@parallel-gt-reset.html

hmm, quite unexpected

<7>[  307.897676] xe 0000:03:00.0: [drm:pf_queue_work_func [xe]]
                  	ASID: 101
                  	VFID: 0
                  	PDATA: 0x0450
                  	Faulted Address: 0x00007ba9d02b7000
                  	FaultType: 0
                  	AccessType: 1
                  	FaultLevel: 3
                  	EngineClass: 1 vcs
                  	EngineInstance: 0
<7>[  307.898010] xe 0000:03:00.0: [drm:pf_queue_work_func [xe]] Fault
response: Unsuccessful -22
<6>[  307.898235] xe 0000:03:00.0: [drm] GT1: reset done
<7>[  307.898301] xe 0000:03:00.0: [drm:pf_queue_work_func [xe]]
                  	ASID: 101
                  	VFID: 0
                  	PDATA: 0x0451
                  	Faulted Address: 0x0000793b1cacc000
                  	FaultType: 0
                  	AccessType: 0
                  	FaultLevel: 3
                  	EngineClass: 1 vcs
                  	EngineInstance: 2
<7>[  307.898523] xe 0000:03:00.0: [drm:xe_hw_engine_snapshot_capture
[xe]] GT1: Proceeding with manual engine snapshot
<7>[  307.898598] xe 0000:03:00.0: [drm:pf_queue_work_func [xe]] Fault
response: Unsuccessful -22
<7>[  307.898620] xe 0000:03:00.0:
[drm:xe_guc_exec_queue_memory_cat_error_handler [xe]] GT1: Engine memory
cat error: engine_class=vcs, logical_mask: 0x1, guc_id=2
<7>[  307.899910] xe 0000:03:00.0:
[drm:xe_guc_exec_queue_memory_cat_error_handler [xe]] GT1: Engine memory
cat error: engine_class=vcs, logical_mask: 0x1, guc_id=3


given than the only difference from this patch is the way how we trigger
the reset, before it was "show" ops:

<6> [207.735904] [IGT] xe_exec_reset: starting subtest parallel-gt-reset
<6> [207.773630] xe 0000:03:00.0: [drm] GT1: trying reset from
force_reset_show [xe]

now it's "write" ops:

<6> [307.847469] [IGT] xe_exec_reset: starting subtest parallel-gt-reset
<6> [307.873879] xe 0000:03:00.0: [drm] GT1: trying reset from
force_reset_write [xe]


> 
>   * igt at xe_pm@s4-d3hot-basic-exec:
>     - shard-bmg:          [PASS][3] -> [ABORT][4]
>    [3]: https://intel-gfx-ci.01.org/tree/intel-xe/IGT_8384/shard-bmg-6/igt@xe_pm@s4-d3hot-basic-exec.html
>    [4]: https://intel-gfx-ci.01.org/tree/intel-xe/IGTPW_13211/shard-bmg-6/igt@xe_pm@s4-d3hot-basic-exec.html

unrelated to xe driver

<4> [373.000446] ======================================================
<4> [373.000460] WARNING: possible circular locking dependency detected
<4> [373.000475] 6.15.0-xe+ #1 Tainted: G     U           N
<4> [373.000490] ------------------------------------------------------
<4> [373.000503] kworker/u64:66/5057 is trying to acquire lock:
<4> [373.000518] ffffffff838b45a8 (rtnl_mutex){+.+.}-{3:3}, at:
rtnl_lock+0x17/0x30
<4> [373.000559]
but task is already holding lock:
<4> [373.000572] ffff8881149a3438 (&tp->control){+.+.}-{3:3}, at:
rtl8152_resume+0x26/0xd0 [r8152]
<4> [373.000612]
which lock already depends on the new lock.