[RFC PATCH 0/1] Add driver load error injection

Jani Nikula jani.nikula at linux.intel.com
Tue Aug 13 10:47:26 UTC 2024


On Fri, 09 Aug 2024, Matthew Brost <matthew.brost at intel.com> wrote:
> Start porting over driver load error injectin from the i915. Eventually
> idea would be make this error injection a bit more generic (drm level,
> or kernel level) but to ensure a stable driver starting with the i915
> implementation.
>
> Not complete as many more injection points need to be added.

Please also bolt this into __i915_inject_probe_error() in
display/ext/i915_utils.c, exercising all the display error handling with
xe too.

BR,
Jani.


>
> Can be tested with:
> for i in {1..200}; do echo "Run $i"; modprobe xe inject_driver_load_error=$i; rmmod xe; done
>
> Will need to a version of this series [1] to avoid lockdep turning off
> after 30ish module loads.
>
> Kernel is currently blowing up on injection point #11 on TGL w/o
> display, will need to start debug their. Stack trace below.
>
> [  196.326118] Setting dangerous option inject_driver_load_error - tainting kernel
> [  196.328408] xe 0000:00:02.0: vgaarb: deactivate vga console
> [  196.328975] xe 0000:00:02.0: [drm:xe_pci_probe [xe]] TIGERLAKE  9a49:0001 dgfx:0 gfx:Xe_LP (12.00) media:Xe_M (12.00) display:no dma_m_s:39 tc:1 gscfi:0 cscfi:0
> [  196.329016] xe 0000:00:02.0: [drm:xe_pci_probe [xe]] Stepping = (G:B0, M:B0, D:D0, B:**)
> [  196.329039] xe 0000:00:02.0: [drm:xe_pci_probe [xe]] SR-IOV support: no (mode: none)
> [  196.330746] xe 0000:00:02.0: [drm] Using GuC firmware from i915/tgl_guc_70.bin version 70.30.0
> [  196.331047] xe 0000:00:02.0: [drm] Injecting failure -19 at checkpoint 11 [xe_guc_log_init:98]
> [  196.331050] xe 0000:00:02.0: [drm] *ERROR* GT0: GuC init failed with -ENODEV
> [  196.338208] xe 0000:00:02.0: [drm] *ERROR* GT0: Failed to initialize uC (-ENODEV)
> [  196.347009] BUG: unable to handle page fault for address: 000000000000a188
> [  196.353903] #PF: supervisor write access in kernel mode
> [  196.359138] #PF: error_code(0x0002) - not-present page
> [  196.364289] PGD 0 P4D 0
> [  196.366842] Oops: Oops: 0002 [#1] PREEMPT SMP NOPTI
> [  196.371735] CPU: 6 UID: 0 PID: 1233 Comm: modprobe Tainted: G     U             6.11.0-rc2-xe+ #3796
> [  196.380875] Tainted: [U]=USER
> [  196.383857] Hardware name: Intel Corporation Tiger Lake Client Platform/TigerLake U DDR4 SODIMM RVP, BIOS TGLSFWI1.R00.3243.A01.2006102133 06/10/2020
> [  196.397237] RIP: 0010:xe_mmio_write32+0x67/0x290 [xe]
> [  196.402332] Code: 48 0f a3 05 c3 c9 5b e2 0f 82 c6 00 00 00 45 89 e6 41 c1 ee 18 41 f7 c4 00 00 00 40 74 7f 45 84 f6 78 74 49 8b 47 28 48 01 c3 <44> 89 2b 48 83 c4 58 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc
> [  196.421085] RSP: 0018:ffffc9000152b820 EFLAGS: 00010006
> [  196.426322] RAX: 0000000000000000 RBX: 000000000000a188 RCX: 0000000000000000
> [  196.433466] RDX: 0000000000010001 RSI: ffffffff82426f19 RDI: ffffffff824343c6
> [  196.440608] RBP: ffff888152678028 R08: 00000000000d6398 R09: 0000000000000001
> [  196.447748] R10: 00000000ffffffff R11: ffff888152628000 R12: 000000000000a188
> [  196.454893] R13: 0000000000010001 R14: 0000000000000000 R15: ffff88815262a308
> [  196.462037] FS:  00007ff3ae103000(0000) GS:ffff88849fb80000(0000) knlGS:0000000000000000
> [  196.470137] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  196.475893] CR2: 000000000000a188 CR3: 0000000156d2a004 CR4: 0000000000f70ef0
> [  196.483036] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  196.490177] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  196.497323] PKRU: 55555554
> [  196.500051] Call Trace:
> [  196.502517]  <TASK>
> [  196.504640]  ? __die+0x1f/0x70
> [  196.507719]  ? page_fault_oops+0x155/0x470
> [  196.511831]  ? stack_trace_save+0x49/0x70
> [  196.515861]  ? do_user_addr_fault+0x63/0x720
> [  196.520151]  ? exc_page_fault+0x63/0x1d0
> [  196.524091]  ? asm_exc_page_fault+0x26/0x30
> [  196.528293]  ? xe_mmio_write32+0x67/0x290 [xe]
> [  196.532777]  xe_force_wake_get+0xc8/0x2b0 [xe]
> [  196.537260]  ? lock_acquire+0xcd/0x300
> [  196.541031]  xe_gt_tlb_invalidation_ggtt+0xa8/0x310 [xe]
> [  196.546380]  ? rcu_is_watching+0x11/0x50
> [  196.550322]  ? __mutex_lock+0x12f/0xd70
> [  196.554179]  ? find_held_lock+0x2b/0x80
> [  196.558031]  ? xe_ggtt_remove_node+0xbf/0xf0 [xe]
> [  196.562772]  xe_ggtt_invalidate+0x19/0x80 [xe]
> [  196.567251]  xe_ggtt_remove_node+0xdf/0xf0 [xe]
> [  196.571818]  xe_ttm_bo_destroy+0x11a/0x220 [xe]
> [  196.576388]  drm_managed_release+0xb0/0x160
> [  196.580593]  devm_drm_dev_init_release+0x54/0x70
> [  196.585232]  release_nodes+0x2e/0xf0
> [  196.588827]  devres_release_all+0x8a/0xc0
> [  196.592858]  device_unbind_cleanup+0x9/0x70
> [  196.597058]  really_probe+0x1a0/0x380
> [  196.600740]  __driver_probe_device+0x73/0x150
> [  196.605108]  driver_probe_device+0x19/0x90
> [  196.609222]  __driver_attach+0xd5/0x1d0
> [  196.613073]  ? __pfx___driver_attach+0x10/0x10
> [  196.617534]  bus_for_each_dev+0x77/0xd0
> [  196.621389]  bus_add_driver+0x110/0x240
> [  196.625238]  driver_register+0x5b/0x110
> [  196.629086]  xe_init+0x3b/0x80 [xe]
> [  196.632615]  ? __pfx_xe_init+0x10/0x10 [xe]
> [  196.636829]  do_one_initcall+0x5e/0x2b0
> [  196.640683]  ? rcu_is_watching+0x11/0x50
> [  196.644622]  ? __kmalloc_cache_noprof+0x24e/0x2f0
> [  196.649343]  do_init_module+0x5f/0x210
> [  196.653113]  init_module_from_file+0x86/0xd0
> [  196.657402]  idempotent_init_module+0x17c/0x230
> [  196.661946]  __x64_sys_finit_module+0x59/0xb0
> [  196.666323]  do_syscall_64+0x68/0x140
> [  196.670006]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> Matt
>
> [1] https://patchwork.freedesktop.org/series/136701/
>
>
> Matthew Brost (1):
>   drm/xe: Add driver load error injection
>
>  drivers/gpu/drm/xe/xe_device.c       | 31 ++++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_device.h       | 15 ++++++++++++++
>  drivers/gpu/drm/xe/xe_device_types.h |  4 ++++
>  drivers/gpu/drm/xe/xe_gt.c           |  5 +++++
>  drivers/gpu/drm/xe/xe_gt_sriov_pf.c  |  4 ++++
>  drivers/gpu/drm/xe/xe_guc.c          |  8 +++++++
>  drivers/gpu/drm/xe/xe_guc_ads.c      |  5 +++++
>  drivers/gpu/drm/xe/xe_guc_ct.c       |  4 ++++
>  drivers/gpu/drm/xe/xe_guc_log.c      |  5 +++++
>  drivers/gpu/drm/xe/xe_mmio.c         |  5 +++++
>  drivers/gpu/drm/xe/xe_module.c       |  5 +++++
>  drivers/gpu/drm/xe/xe_module.h       |  3 +++
>  drivers/gpu/drm/xe/xe_pci.c          |  9 ++++++++
>  drivers/gpu/drm/xe/xe_pm.c           |  8 +++++++
>  drivers/gpu/drm/xe/xe_sriov.c        |  8 ++++++-
>  drivers/gpu/drm/xe/xe_tile.c         |  4 ++++
>  drivers/gpu/drm/xe/xe_uc.c           |  4 ++++
>  drivers/gpu/drm/xe/xe_wa.c           |  5 +++++
>  drivers/gpu/drm/xe/xe_wopcm.c        |  4 ++++
>  19 files changed, 135 insertions(+), 1 deletion(-)

-- 
Jani Nikula, Intel


More information about the Intel-xe mailing list