[Intel-xe] [PATCH] drm/xe: don't auto fall back to execlist mode if guc failed to init
Mauro Carvalho Chehab
mauro.chehab at linux.intel.com
Fri Mar 24 07:37:04 UTC 2023
On Thu, 23 Mar 2023 23:08:58 +0000
"Chang, Yu bruce" <yu.bruce.chang at intel.com> wrote:
> > -----Original Message-----
> > From: Brost, Matthew <matthew.brost at intel.com>
> > Sent: Thursday, March 23, 2023 3:53 PM
> > To: Chang, Yu bruce <yu.bruce.chang at intel.com>
> > Cc: intel-xe at lists.freedesktop.org
> > Subject: Re: [Intel-xe] [PATCH] drm/xe: don't auto fall back to execlist mode
> > if guc failed to init
> >
> > On Thu, Mar 23, 2023 at 08:23:13PM +0000, Chang, Bruce wrote:
> > > In general, this is due to FW load failure, should just report error
> > > and fail the probe so that user can easily retry again.
> > >
> > > Cc: Matt Roper <matthew.d.roper at intel.com>
> > > Signed-off-by: Bruce Chang <yu.bruce.chang at intel.com>
> >
> > I have not tested this but assuming you did:
> > Reviewed-by: Matthew Brost <matthew.brost at intel.com>
> >
> Yes, I tested on PVC and it used to fall back to execlist mode and constantly
> print out EXECLIST_STATUS. Now all those are not showing after this change.
>
> There is still other unrelated issues during __pfx_ggtt_fini_noalloc, and need
> to be fixed as below.
>
> [ 223.839894] BUG: KASAN: null-ptr-deref in ttm_resource_free+0xe4/0x140 [ttm]
> [ 223.847211] Read of size 8 at addr 0000000000000018 by task systemd-udevd/566
>
> [ 223.856141] CPU: 0 PID: 566 Comm: systemd-udevd Not tainted 6.2.0-xe+ #4
> [ 223.864921] Hardware name: Intel Corporation WilsonCity/WilsonCity, BIOS WLYDCRB1.SYS.0020.P84.2103030140 03/03/2021
> [ 223.877365] Call Trace:
> [ 223.881707] <TASK>
> [ 223.885658] dump_stack_lvl+0x5b/0x85
> [ 223.891200] print_report+0x499/0x4aa
> [ 223.896690] ? ttm_resource_free+0xe4/0x140 [ttm]
> [ 223.903268] kasan_report+0x99/0x1a0
> [ 223.908683] ? ttm_resource_free+0xe4/0x140 [ttm]
> [ 223.915210] ttm_resource_free+0xe4/0x140 [ttm]
> [ 223.921621] ttm_bo_release+0x3e5/0x550 [ttm]
> [ 223.927811] ? __pfx_ttm_bo_release+0x10/0x10 [ttm]
> [ 223.934530] ? ttm_bo_kunmap+0x11f/0x160 [ttm]
> [ 223.940775] ? __pfx_ggtt_fini_noalloc+0x10/0x10 [xe]
Xe driver release is currently buggy. there's a just added test on
IGT that load/unload the driver 10 times[1].
[1] this is a good way to check if object references are properly
released and that the object lifetime cycle is correct.
This is what happens if you run it (tested on TGL):
$ sudo ./build/tests/xe_module_load --run many-reload --debug
IGT-Version: 1.27.1-g0682c2b07c7e (x86_64) (Linux: 6.2.0-xe-1ae4dd9e8+ x86_64)
Starting subtest: many-reload
(xe_module_load:3070) DEBUG: reload cycle: 0
(xe_module_load:3070) igt_kmod-DEBUG: Module mei_pxp unloaded immediately
(xe_module_load:3070) igt_kmod-DEBUG: Module mei_hdcp unloaded immediately
(xe_module_load:3070) igt_kmod-DEBUG: Module drm_kms_helper could not be found or does not exist. err: -2
(xe_module_load:3070) igt_kmod-DEBUG: Could not remove module drm_kms_helper (No such file or directory)
(xe_module_load:3070) igt_kmod-DEBUG: Module drm unloaded immediately
(xe_module_load:3070) DEBUG: reload cycle: 1
(xe_module_load:3070) igt_kmod-DEBUG: Module snd_hda_intel unloaded immediately
(xe_module_load:3070) igt_kmod-DEBUG: Module xe unloaded immediately
(xe_module_load:3070) igt_kmod-DEBUG: Module drm_display_helper unloaded immediately
(xe_module_load:3070) igt_kmod-DEBUG: Module drm_kms_helper unloaded immediately
(xe_module_load:3070) igt_kmod-DEBUG: Module gpu_sched unloaded immediately
(xe_module_load:3070) igt_kmod-DEBUG: Module drm_suballoc_helper unloaded immediately
(xe_module_load:3070) igt_kmod-DEBUG: Module drm_buddy unloaded immediately
(xe_module_load:3070) igt_kmod-DEBUG: Module drm_ttm_helper unloaded immediately
(xe_module_load:3070) igt_kmod-DEBUG: Module ttm unloaded immediately
(xe_module_load:3070) igt_kmod-DEBUG: Module drm unloaded immediately
(xe_module_load:3070) DEBUG: reload cycle: 2
(xe_module_load:3070) igt_kmod-DEBUG: Module snd_hda_intel unloaded immediately
(xe_module_load:3070) igt_kmod-DEBUG: Module xe unloaded immediately
(xe_module_load:3070) igt_kmod-DEBUG: Module drm_display_helper unloaded immediately
(xe_module_load:3070) igt_kmod-DEBUG: Module drm_kms_helper unloaded immediately
(xe_module_load:3070) igt_kmod-DEBUG: Module gpu_sched unloaded immediately
(xe_module_load:3070) igt_kmod-DEBUG: Module drm_suballoc_helper unloaded immediately
(xe_module_load:3070) igt_kmod-DEBUG: Module drm_buddy unloaded immediately
(xe_module_load:3070) igt_kmod-DEBUG: Module drm_ttm_helper unloaded immediately
(xe_module_load:3070) igt_kmod-DEBUG: Module ttm unloaded immediately
(xe_module_load:3070) igt_kmod-DEBUG: Module drm unloaded immediately
...
See the dmesg for the above below.
Regards,
Mauro
Dmesg:
[ 330.190943] **********************************************************
[ 330.190947] ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE **
[ 330.190951] ** **
[ 330.190955] ** trace_printk() being used. Allocating extra memory. **
[ 330.190959] ** **
[ 330.190962] ** This means that this is a DEBUG kernel and it is **
[ 330.190966] ** unsafe for production use. **
[ 330.190970] ** **
[ 330.190974] ** If you see this message and you are not debugging **
[ 330.190977] ** the kernel, report this immediately to your vendor! **
[ 330.190981] ** **
[ 330.190985] ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE **
[ 330.190988] **********************************************************
[ 330.260128] xe 0000:00:02.0: vgaarb: deactivate vga console
[ 330.302169] xe 0000:00:02.0: vgaarb: deactivate vga console
[ 330.306461] xe 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=io+mem
[ 330.312251] GT topology dss mask (geometry): 00000000,0000003f
[ 330.312259] GT topology dss mask (compute): 00000000,00000000
[ 330.312264] GT topology EU mask per DSS: 0000ffff
[ 330.321566] xe 0000:00:02.0: [drm] Finished loading DMC firmware i915/tgl_dmc_ver2_12.bin (v2.12)
[ 330.682290] xe REG[0x2340-0x235f]: allow read access
[ 330.682307] xe REG[0x7010-0x7017]: allow rw access
[ 330.682334] xe REG[0x7018-0x701f]: allow rw access
[ 330.683282] xe REG[0x223a8-0x223af]: allow read access
[ 330.684245] xe REG[0x1c03a8-0x1c03af]: allow read access
[ 330.685168] xe REG[0x1d03a8-0x1d03af]: allow read access
[ 330.686083] xe REG[0x1c83a8-0x1c83af]: allow read access
[ 330.805598] [drm] Initialized xe 1.1.0 20201103 for 0000:00:02.0 on minor 0
[ 331.008489] ACPI: video: Video Device [GFX0] (multi-head: yes rom: no post: no)
[ 331.056568] input: Video Bus as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/LNXVIDEO:00/input/input8
[ 331.064576] xe 0000:00:02.0: [drm] Cannot find any crtc or sizes
[ 331.075111] xe 0000:00:02.0: [drm] Cannot find any crtc or sizes
[ 331.077136] xe 0000:00:02.0: [drm] Cannot find any crtc or sizes
[ 331.321351] snd_hda_intel 0000:00:1f.3: enabling device (0000 -> 0002)
[ 331.340407] snd_hda_intel 0000:00:1f.3: bound 0000:00:02.0 (ops i915_audio_component_bind_ops [xe])
[ 331.469991] input: HDA Intel PCH HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input9
[ 331.473074] input: HDA Intel PCH HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input10
[ 331.476405] input: HDA Intel PCH HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input11
[ 331.478857] input: HDA Intel PCH HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input12
[ 334.010143] ACPI: bus type drm_connector unregistered
[ 334.130906] ACPI: bus type drm_connector registered
[ 334.656848] xe 0000:00:02.0: vgaarb: deactivate vga console
[ 334.683973] xe 0000:00:02.0: vgaarb: deactivate vga console
[ 334.690364] GT topology dss mask (geometry): 00000000,0000003f
[ 334.690373] GT topology dss mask (compute): 00000000,00000000
[ 334.690377] GT topology EU mask per DSS: 0000ffff
[ 334.692551] xe 0000:00:02.0: [drm] Finished loading DMC firmware i915/tgl_dmc_ver2_12.bin (v2.12)
[ 335.042555] xe REG[0x2340-0x235f]: allow read access
[ 335.042574] xe REG[0x7010-0x7017]: allow rw access
[ 335.042580] xe REG[0x7018-0x701f]: allow rw access
[ 335.043634] xe REG[0x223a8-0x223af]: allow read access
[ 335.044892] xe REG[0x1c03a8-0x1c03af]: allow read access
[ 335.045951] xe REG[0x1d03a8-0x1d03af]: allow read access
[ 335.047052] xe REG[0x1c83a8-0x1c83af]: allow read access
[ 335.120059] [drm] Initialized xe 1.1.0 20201103 for 0000:00:02.0 on minor 0
[ 335.283192] ACPI: video: Video Device [GFX0] (multi-head: yes rom: no post: no)
[ 335.342193] input: Video Bus as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/LNXVIDEO:00/input/input13
[ 335.349695] xe 0000:00:02.0: [drm] Cannot find any crtc or sizes
[ 335.363384] xe 0000:00:02.0: [drm] Cannot find any crtc or sizes
[ 335.365528] xe 0000:00:02.0: [drm] Cannot find any crtc or sizes
[ 335.414725] snd_hda_intel 0000:00:1f.3: bound 0000:00:02.0 (ops i915_audio_component_bind_ops [xe])
[ 336.447397] snd_hda_intel 0000:00:1f.3: azx_get_response timeout, switching to polling mode: last cmd=0x200f0000
[ 337.448522] snd_hda_intel 0000:00:1f.3: No response from codec, disabling MSI: last cmd=0x200f0000
[ 338.456521] snd_hda_intel 0000:00:1f.3: Codec #2 probe error; disabling it...
[ 339.463518] snd_hda_intel 0000:00:1f.3: azx_get_response timeout, switching to single_cmd mode: last cmd=0x200f0000
[ 339.465715] hdaudio hdaudioC0D2: no AFG or MFG node found
[ 339.466992] snd_hda_intel 0000:00:1f.3: no codecs initialized
[ 339.475013] ==================================================================
[ 339.475109] BUG: KASAN: use-after-free in snd_card_free+0x99/0x130
[ 339.475125] Read of size 1 at addr ffff88814252ccda by task xe_module_load/3070
[ 339.475143] CPU: 1 PID: 3070 Comm: xe_module_load Not tainted 6.2.0-xe-1ae4dd9e8+ #2
[ 339.475157] Hardware name: Intel(R) Client Systems NUC11TNHi7/NUC11TNBi7, BIOS TNTGL357.0062.2021.1203.1108 12/03/2021
[ 339.475171] Call Trace:
[ 339.475179] <TASK>
[ 339.475186] dump_stack_lvl+0x5b/0x85
[ 339.475197] print_report+0x171/0x4aa
[ 339.475210] ? snd_card_free+0x99/0x130
[ 339.475219] kasan_report+0x99/0x1a0
[ 339.475230] ? snd_card_free+0x99/0x130
[ 339.475243] snd_card_free+0x99/0x130
[ 339.475263] ? __pfx_snd_card_free+0x10/0x10
[ 339.475278] ? azx_remove+0xb4/0xe0 [snd_hda_intel]
[ 339.475303] pci_device_remove+0x66/0x100
[ 339.475316] device_release_driver_internal+0xfa/0x1c0
[ 339.475330] unbind_store+0x13c/0x160
[ 339.475340] ? __pfx_sysfs_kf_write+0x10/0x10
[ 339.475351] kernfs_fop_write_iter+0x1bc/0x260
[ 339.475363] vfs_write+0x57d/0x760
[ 339.475374] ? __pfx_vfs_write+0x10/0x10
[ 339.475388] ? __fget_light+0x9e/0x100
[ 339.475399] ksys_write+0xc7/0x170
[ 339.475409] ? __pfx_ksys_write+0x10/0x10
[ 339.475421] ? lockdep_hardirqs_on_prepare+0x128/0x230
[ 339.475433] ? syscall_enter_from_user_mode+0x21/0x50
[ 339.475446] do_syscall_64+0x3c/0x90
[ 339.475457] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 339.475469] RIP: 0033:0x7ff883d14a37
[ 339.475479] Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[ 339.475504] RSP: 002b:00007ffcaa4f7068 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 339.475520] RAX: ffffffffffffffda RBX: 0000561562a83f58 RCX: 00007ff883d14a37
[ 339.475532] RDX: 000000000000000c RSI: 0000561562a83f6b RDI: 0000000000000003
[ 339.475544] RBP: 0000561562a83e80 R08: 0000000000000033 R09: 00007ffcaa4f6ef0
[ 339.475556] R10: 0000000000000100 R11: 0000000000000246 R12: 00007ffcaa4f7100
[ 339.475568] R13: 0000000000000003 R14: 0000561562a83f6b R15: 00007ff88415b040
[ 339.475583] </TASK>
[ 339.475596] Allocated by task 3070:
[ 339.475605] kasan_save_stack+0x22/0x50
[ 339.475608] kasan_set_track+0x25/0x30
[ 339.475612] __kasan_kmalloc+0x82/0x90
[ 339.475615] __kmalloc+0x5f/0x1b0
[ 339.475619] snd_card_new+0x60/0xc0
[ 339.475623] azx_probe+0x14c/0xf90 [snd_hda_intel]
[ 339.475632] pci_device_probe+0x100/0x210
[ 339.475636] really_probe+0x143/0x4d0
[ 339.475639] __driver_probe_device+0xc7/0x220
[ 339.475643] driver_probe_device+0x49/0xf0
[ 339.475646] __driver_attach+0x101/0x200
[ 339.475650] bus_for_each_dev+0xeb/0x150
[ 339.475653] bus_add_driver+0x2a0/0x2f0
[ 339.475656] driver_register+0xdc/0x170
[ 339.475660] do_one_initcall+0xbd/0x400
[ 339.475664] do_init_module+0xe4/0x320
[ 339.475668] load_module+0x3011/0x3320
[ 339.475671] __do_sys_finit_module+0x110/0x1b0
[ 339.475675] do_syscall_64+0x3c/0x90
[ 339.475678] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 339.475687] Freed by task 89:
[ 339.475695] kasan_save_stack+0x22/0x50
[ 339.475698] kasan_set_track+0x25/0x30
[ 339.475701] kasan_save_free_info+0x2e/0x50
[ 339.475705] __kasan_slab_free+0x109/0x1a0
[ 339.475708] __kmem_cache_free+0x221/0x400
[ 339.475712] device_release+0x5a/0xf0
[ 339.475715] kobject_put+0xde/0x270
[ 339.475719] snd_card_free+0x114/0x130
[ 339.475722] process_one_work+0x527/0x9d0
[ 339.475727] worker_thread+0x2d1/0x640
[ 339.475730] kthread+0x183/0x1c0
[ 339.475734] ret_from_fork+0x29/0x50
[ 339.475743] The buggy address belongs to the object at ffff88814252c000
which belongs to the cache kmalloc-4k of size 4096
[ 339.475762] The buggy address is located 3290 bytes inside of
4096-byte region [ffff88814252c000, ffff88814252d000)
[ 339.475786] The buggy address belongs to the physical page:
[ 339.475796] page:ffffea0005094a00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x142528
[ 339.475801] head:ffffea0005094a00 order:3 compound_mapcount:0 subpages_mapcount:0 compound_pincount:0
[ 339.475804] flags: 0x4000000000010200(slab|head|zone=2)
[ 339.475810] raw: 4000000000010200 ffff8881000433c0 ffffea0004c77210 ffffea0004abb410
[ 339.475813] raw: 0000000000000000 0000000000020002 00000001ffffffff 0000000000000000
[ 339.475816] page dumped because: kasan: bad access detected
[ 339.475824] Memory state around the buggy address:
[ 339.475833] ffff88814252cb80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 339.475846] ffff88814252cc00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 339.475858] >ffff88814252cc80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 339.475870] ^
[ 339.475881] ffff88814252cd00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 339.475894] ffff88814252cd80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 339.475906] ==================================================================
[ 339.475932] Disabling lock debugging due to kernel taint
[ 340.320483] ACPI: bus type drm_connector unregistered
[ 340.438735] ACPI: bus type drm_connector registered
...
More information about the Intel-xe
mailing list