[RFC PATCH 0/1] Add driver load error injection

Matthew Brost matthew.brost at intel.com
Fri Aug 9 22:44:23 UTC 2024


Start porting over driver load error injectin from the i915. Eventually
idea would be make this error injection a bit more generic (drm level,
or kernel level) but to ensure a stable driver starting with the i915
implementation.

Not complete as many more injection points need to be added.

Can be tested with:
for i in {1..200}; do echo "Run $i"; modprobe xe inject_driver_load_error=$i; rmmod xe; done

Will need to a version of this series [1] to avoid lockdep turning off
after 30ish module loads.

Kernel is currently blowing up on injection point #11 on TGL w/o
display, will need to start debug their. Stack trace below.

[  196.326118] Setting dangerous option inject_driver_load_error - tainting kernel
[  196.328408] xe 0000:00:02.0: vgaarb: deactivate vga console
[  196.328975] xe 0000:00:02.0: [drm:xe_pci_probe [xe]] TIGERLAKE  9a49:0001 dgfx:0 gfx:Xe_LP (12.00) media:Xe_M (12.00) display:no dma_m_s:39 tc:1 gscfi:0 cscfi:0
[  196.329016] xe 0000:00:02.0: [drm:xe_pci_probe [xe]] Stepping = (G:B0, M:B0, D:D0, B:**)
[  196.329039] xe 0000:00:02.0: [drm:xe_pci_probe [xe]] SR-IOV support: no (mode: none)
[  196.330746] xe 0000:00:02.0: [drm] Using GuC firmware from i915/tgl_guc_70.bin version 70.30.0
[  196.331047] xe 0000:00:02.0: [drm] Injecting failure -19 at checkpoint 11 [xe_guc_log_init:98]
[  196.331050] xe 0000:00:02.0: [drm] *ERROR* GT0: GuC init failed with -ENODEV
[  196.338208] xe 0000:00:02.0: [drm] *ERROR* GT0: Failed to initialize uC (-ENODEV)
[  196.347009] BUG: unable to handle page fault for address: 000000000000a188
[  196.353903] #PF: supervisor write access in kernel mode
[  196.359138] #PF: error_code(0x0002) - not-present page
[  196.364289] PGD 0 P4D 0
[  196.366842] Oops: Oops: 0002 [#1] PREEMPT SMP NOPTI
[  196.371735] CPU: 6 UID: 0 PID: 1233 Comm: modprobe Tainted: G     U             6.11.0-rc2-xe+ #3796
[  196.380875] Tainted: [U]=USER
[  196.383857] Hardware name: Intel Corporation Tiger Lake Client Platform/TigerLake U DDR4 SODIMM RVP, BIOS TGLSFWI1.R00.3243.A01.2006102133 06/10/2020
[  196.397237] RIP: 0010:xe_mmio_write32+0x67/0x290 [xe]
[  196.402332] Code: 48 0f a3 05 c3 c9 5b e2 0f 82 c6 00 00 00 45 89 e6 41 c1 ee 18 41 f7 c4 00 00 00 40 74 7f 45 84 f6 78 74 49 8b 47 28 48 01 c3 <44> 89 2b 48 83 c4 58 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc
[  196.421085] RSP: 0018:ffffc9000152b820 EFLAGS: 00010006
[  196.426322] RAX: 0000000000000000 RBX: 000000000000a188 RCX: 0000000000000000
[  196.433466] RDX: 0000000000010001 RSI: ffffffff82426f19 RDI: ffffffff824343c6
[  196.440608] RBP: ffff888152678028 R08: 00000000000d6398 R09: 0000000000000001
[  196.447748] R10: 00000000ffffffff R11: ffff888152628000 R12: 000000000000a188
[  196.454893] R13: 0000000000010001 R14: 0000000000000000 R15: ffff88815262a308
[  196.462037] FS:  00007ff3ae103000(0000) GS:ffff88849fb80000(0000) knlGS:0000000000000000
[  196.470137] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  196.475893] CR2: 000000000000a188 CR3: 0000000156d2a004 CR4: 0000000000f70ef0
[  196.483036] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  196.490177] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  196.497323] PKRU: 55555554
[  196.500051] Call Trace:
[  196.502517]  <TASK>
[  196.504640]  ? __die+0x1f/0x70
[  196.507719]  ? page_fault_oops+0x155/0x470
[  196.511831]  ? stack_trace_save+0x49/0x70
[  196.515861]  ? do_user_addr_fault+0x63/0x720
[  196.520151]  ? exc_page_fault+0x63/0x1d0
[  196.524091]  ? asm_exc_page_fault+0x26/0x30
[  196.528293]  ? xe_mmio_write32+0x67/0x290 [xe]
[  196.532777]  xe_force_wake_get+0xc8/0x2b0 [xe]
[  196.537260]  ? lock_acquire+0xcd/0x300
[  196.541031]  xe_gt_tlb_invalidation_ggtt+0xa8/0x310 [xe]
[  196.546380]  ? rcu_is_watching+0x11/0x50
[  196.550322]  ? __mutex_lock+0x12f/0xd70
[  196.554179]  ? find_held_lock+0x2b/0x80
[  196.558031]  ? xe_ggtt_remove_node+0xbf/0xf0 [xe]
[  196.562772]  xe_ggtt_invalidate+0x19/0x80 [xe]
[  196.567251]  xe_ggtt_remove_node+0xdf/0xf0 [xe]
[  196.571818]  xe_ttm_bo_destroy+0x11a/0x220 [xe]
[  196.576388]  drm_managed_release+0xb0/0x160
[  196.580593]  devm_drm_dev_init_release+0x54/0x70
[  196.585232]  release_nodes+0x2e/0xf0
[  196.588827]  devres_release_all+0x8a/0xc0
[  196.592858]  device_unbind_cleanup+0x9/0x70
[  196.597058]  really_probe+0x1a0/0x380
[  196.600740]  __driver_probe_device+0x73/0x150
[  196.605108]  driver_probe_device+0x19/0x90
[  196.609222]  __driver_attach+0xd5/0x1d0
[  196.613073]  ? __pfx___driver_attach+0x10/0x10
[  196.617534]  bus_for_each_dev+0x77/0xd0
[  196.621389]  bus_add_driver+0x110/0x240
[  196.625238]  driver_register+0x5b/0x110
[  196.629086]  xe_init+0x3b/0x80 [xe]
[  196.632615]  ? __pfx_xe_init+0x10/0x10 [xe]
[  196.636829]  do_one_initcall+0x5e/0x2b0
[  196.640683]  ? rcu_is_watching+0x11/0x50
[  196.644622]  ? __kmalloc_cache_noprof+0x24e/0x2f0
[  196.649343]  do_init_module+0x5f/0x210
[  196.653113]  init_module_from_file+0x86/0xd0
[  196.657402]  idempotent_init_module+0x17c/0x230
[  196.661946]  __x64_sys_finit_module+0x59/0xb0
[  196.666323]  do_syscall_64+0x68/0x140
[  196.670006]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

Matt

[1] https://patchwork.freedesktop.org/series/136701/


Matthew Brost (1):
  drm/xe: Add driver load error injection

 drivers/gpu/drm/xe/xe_device.c       | 31 ++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_device.h       | 15 ++++++++++++++
 drivers/gpu/drm/xe/xe_device_types.h |  4 ++++
 drivers/gpu/drm/xe/xe_gt.c           |  5 +++++
 drivers/gpu/drm/xe/xe_gt_sriov_pf.c  |  4 ++++
 drivers/gpu/drm/xe/xe_guc.c          |  8 +++++++
 drivers/gpu/drm/xe/xe_guc_ads.c      |  5 +++++
 drivers/gpu/drm/xe/xe_guc_ct.c       |  4 ++++
 drivers/gpu/drm/xe/xe_guc_log.c      |  5 +++++
 drivers/gpu/drm/xe/xe_mmio.c         |  5 +++++
 drivers/gpu/drm/xe/xe_module.c       |  5 +++++
 drivers/gpu/drm/xe/xe_module.h       |  3 +++
 drivers/gpu/drm/xe/xe_pci.c          |  9 ++++++++
 drivers/gpu/drm/xe/xe_pm.c           |  8 +++++++
 drivers/gpu/drm/xe/xe_sriov.c        |  8 ++++++-
 drivers/gpu/drm/xe/xe_tile.c         |  4 ++++
 drivers/gpu/drm/xe/xe_uc.c           |  4 ++++
 drivers/gpu/drm/xe/xe_wa.c           |  5 +++++
 drivers/gpu/drm/xe/xe_wopcm.c        |  4 ++++
 19 files changed, 135 insertions(+), 1 deletion(-)

-- 
2.34.1



More information about the Intel-xe mailing list