[PATCH] Revert "drm/xe: Do not grab forcewakes when issuing GGTT TLB invalidation via GuC"

Lucas De Marchi lucas.demarchi at intel.com
Thu Mar 7 19:06:04 UTC 2024


On Thu, Mar 07, 2024 at 01:02:38PM -0600, Lucas De Marchi wrote:
>On Thu, Mar 07, 2024 at 01:42:37PM -0500, Rodrigo Vivi wrote:
>>On Thu, Mar 07, 2024 at 12:09:42PM -0600, Lucas De Marchi wrote:
>>>On Wed, Mar 06, 2024 at 07:11:05PM -0500, Rodrigo Vivi wrote:
>>>> This reverts commit 27ee413bbc0b04146f4ee1c7444422bf18dafd47.
>>>>
>>>> On DG2 after this patch:
>>>>
>>>> [  439.105953] general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN NOPTI
>>>> [  439.117349] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
>>>> [  439.124924] CPU: 8 PID: 7160 Comm: insmod Tainted: G     U     OE      6.8.0-rc7+ #5
>>>> [  439.132669] Hardware name: iBUYPOWER INTEL/B660 DS3H AC DDR4-Y1, BIOS F5 12/17/2021
>>>> [  439.140332] RIP: 0010:xe_ggtt_invalidate+0x150/0x260 [xe]
>>>> [  439.145860] Code: 48 8d 7d 18 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 b3 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 8b 5d 18 48 89 da 48 c1 ea 03 <80> 3c 02 00 0f 85 87 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 8b
>>>> [  439.164609] RSP: 0018:ffffc9000201eee8 EFLAGS: 00010246
>>>> [  439.169843] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
>>>> [  439.176980] RDX: 0000000000000000 RSI: ffffffffb6ae4040 RDI: ffff88818231a478
>>>> [  439.184121] RBP: ffff88818231a460 R08: 0000000000000001 R09: fffffbfff7563e5c
>>>> [  439.191259] R10: fffff52000403d92 R11: 0000000000000001 R12: ffff888185330028
>>>> [  439.198401] R13: ffffffffffffffff R14: 0000000000001000 R15: 0000000000000000
>>>> [  439.205543] FS:  00007f6c0fd92740(0000) GS:ffff888f76400000(0000) knlGS:0000000000000000
>>>> [  439.213634] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [  439.219393] CR2: 00007ffedd41de70 CR3: 00000001063ec000 CR4: 0000000000750ef0
>>>> [  439.226534] PKRU: 55555554
>>>> [  439.229256] Call Trace:
>>>> [  439.231713]  <TASK>
>>>> [  439.233832]  ? die_addr+0x3c/0xa0
>>>> [  439.237165]  ? exc_general_protection+0x158/0x240
>>>> [  439.241882]  ? asm_exc_general_protection+0x22/0x30
>>>> [  439.246774]  ? xe_ggtt_invalidate+0x150/0x260 [xe]
>>>> [  439.251679]  ? xe_ggtt_invalidate+0xa0/0x260 [xe]
>>>> [  439.256490]  __xe_ggtt_insert_bo_at+0x24b/0x720 [xe]
>>>> [  439.261556]  ? __pfx____xe_bo_create_locked+0x10/0x10 [xe]
>>>> [  439.267144]  ? __pfx___xe_ggtt_insert_bo_at+0x10/0x10 [xe]
>>>> [  439.272737]  __xe_bo_create_locked+0x4af/0x1080 [xe]
>>>> [  439.277806]  xe_bo_create_pin_map_at+0x40/0x400 [xe]
>>>> [  439.282873]  ? xe_uc_fw_check_version_requirements+0x496/0x900 [xe]
>>>> [  439.289255]  xe_managed_bo_create_from_data+0x38/0x130 [xe]
>>>> [  439.294934]  xe_uc_fw_init+0xe71/0x3270 [xe]
>>>> [  439.299317]  ? __pfx_xe_uc_fw_init+0x10/0x10 [xe]
>>>> [  439.304127]  ? __pfx_lock_acquired+0x10/0x10
>>>> [  439.308406]  ? __pfx___drm_printfn_info+0x10/0x10
>>>> [  439.313118]  ? _raw_spin_unlock_irqrestore+0x4b/0x80
>>>> [  439.318093]  ? __pm_runtime_resume+0x7f/0x110
>>>> [  439.322458]  xe_guc_init+0x8d/0x890 [xe]
>>>> [  439.326497]  xe_uc_init+0x65/0x1a0 [xe]
>>>> [  439.330440]  xe_gt_init_hwconfig+0xdc/0x180 [xe]
>>>> [  439.335161]  xe_device_probe+0x747/0x1060 [xe]
>>>> [  439.339712]  ? __pfx___drmm_mutex_release+0x10/0x10
>>>> [  439.344599]  ? __drmm_add_action+0x19d/0x280
>>>> [  439.348883]  ? __pfx___drmm_mutex_release+0x10/0x10
>>>> [  439.353769]  xe_pci_probe+0x168b/0x2f30 [xe]
>>>> [  439.358154]  ? __pfx_lock_acquired+0x10/0x10
>>>> [  439.362433]  ? __pfx_xe_pci_probe+0x10/0x10 [xe]
>>>> [  439.367158]  ? _raw_spin_unlock_irqrestore+0x62/0x80
>>>> [  439.372131]  ? lockdep_hardirqs_on+0xc7/0x140
>>>> [  439.376498]  ? _raw_spin_unlock_irqrestore+0x4b/0x80
>>>> [  439.381473]  ? __pfx_xe_pci_probe+0x10/0x10 [xe]
>>>> [  439.386196]  local_pci_probe+0xd6/0x190
>>>> [  439.390042]  pci_device_probe+0x223/0x740
>>>> [  439.394064]  ? __pfx_pci_device_probe+0x10/0x10
>>>> [  439.398605]  ? kernfs_create_link+0x167/0x230
>>>> [  439.402970]  ? do_raw_spin_unlock+0x54/0x1f0
>>>> [  439.407252]  really_probe+0x3df/0xb80
>>>> [  439.410926]  __driver_probe_device+0x18c/0x450
>>>> [  439.415382]  driver_probe_device+0x4a/0x120
>>>> [  439.419572]  __driver_attach+0x1e1/0x4a0
>>>> [  439.423505]  ? __pfx___driver_attach+0x10/0x10
>>>> [  439.427963]  bus_for_each_dev+0xf2/0x160
>>>> [  439.431897]  ? __pfx_bus_for_each_dev+0x10/0x10
>>>> [  439.436436]  bus_add_driver+0x29d/0x570
>>>> [  439.440283]  driver_register+0x130/0x450
>>>> [  439.444218]  ? __pfx_xe_init+0x10/0x10 [xe]
>>>> [  439.448507]  xe_init+0x81/0x140 [xe]
>>>> [  439.452188]  ? __pfx_xe_init+0x10/0x10 [xe]
>>>> [  439.456478]  do_one_initcall+0xcf/0x420
>>>> [  439.460326]  ? __pfx_do_one_initcall+0x10/0x10
>>>> [  439.464783]  ? kasan_unpoison+0x40/0x70
>>>> [  439.468631]  do_init_module+0x238/0x730
>>>> [  439.472478]  load_module+0x5ff7/0x6c30
>>>> [  439.476242]  ? __pfx_load_module+0x10/0x10
>>>> [  439.480356]  ? ima_post_read_file+0x163/0x190
>>>> [  439.484722]  ? __pfx_ima_post_read_file+0x10/0x10
>>>> [  439.489435]  ? security_kernel_post_read_file+0x6d/0xb0
>>>> [  439.494673]  ? __pfx_kernel_read_file+0x10/0x10
>>>> [  439.499220]  ? init_module_from_file+0xc0/0x100
>>>> [  439.503758]  init_module_from_file+0xc0/0x100
>>>> [  439.508125]  ? __pfx_init_module_from_file+0x10/0x10
>>>> [  439.513103]  ? do_raw_spin_unlock+0x54/0x1f0
>>>> [  439.517382]  idempotent_init_module+0x241/0x660
>>>> [  439.521924]  ? __pfx_idempotent_init_module+0x10/0x10
>>>> [  439.526986]  ? security_capable+0x6d/0xb0
>>>> [  439.531003]  __x64_sys_finit_module+0xba/0x130
>>>> [  439.535459]  do_syscall_64+0x97/0x190
>>>> [  439.539135]  ? lockdep_hardirqs_on_prepare+0x17b/0x420
>>>> [  439.544282]  ? do_syscall_64+0xa7/0x190
>>>> [  439.548125]  ? lockdep_hardirqs_on+0xc7/0x140
>>>> [  439.552493]  ? do_syscall_64+0xa7/0x190
>>>> [  439.556341]  ? do_syscall_64+0xa7/0x190
>>>> [  439.560188]  ? lockdep_hardirqs_on+0xc7/0x140
>>>> [  439.564556]  ? do_syscall_64+0xa7/0x190
>>>> [  439.568401]  ? do_syscall_64+0xa7/0x190
>>>> [  439.572249]  ? lockdep_hardirqs_on_prepare+0x17b/0x420
>>>> [  439.577395]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
>>>> [  439.582460] RIP: 0033:0x7f6c0f73160d
>>>> [  439.586047] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f3 47 0c 00 f7 d8 64 89 01 48
>>>> [  439.604792] RSP: 002b:00007ffd0ecad3b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
>>>> [  439.612368] RAX: ffffffffffffffda RBX: 0000565382a36740 RCX: 00007f6c0f73160d
>>>> [  439.619506] RDX: 0000000000000000 RSI: 0000565382a362a0 RDI: 0000000000000003
>>>> [  439.626644] RBP: 00007ffd0ecad470 R08: 0000000000000000 R09: 0000565382a37760
>>>> [  439.633782] R10: 0000000000000003 R11: 0000000000000246 R12: 0000565382a362a0
>>>> [  439.640925] R13: 0000000000000000 R14: 0000565382a38790 R15: 0000565382a362a0
>>>> [  439.648066]  </TASK>
>>>> [  439.650263] Modules linked in: xe(OE+) snd_hda_codec_hdmi snd_seq_dummy snd_hrtimer rfcomm nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr bnep sunrpc binfmt_misc vfat fat intel_rapl_msr snd_sof_pci_intel_tgl iwlmvm intel_rapl_common snd_sof_intel_hda_common snd_soc_hdac_hda snd_sof_pci intel_uncore_frequency snd_sof_xtensa_dsp intel_uncore_frequency_common snd_sof_intel_hda x86_pkg_temp_thermal snd_sof intel_powerclamp mac80211 snd_sof_utils snd_soc_acpi_intel_match snd_soc_acpi snd_soc_core coretemp snd_compress snd_sof_intel_hda_mlink snd_hda_ext_core kvm_intel snd_hda_intel snd_intel_dspcfg libarc4 snd_hda_codec kvm snd_hwdep snd_hda_core btusb iwlwifi irqbypass snd_seq btrtl snd_seq_device iTCO_wdt btintel rapl pmt_telemetry intel_pmc_bxt snd_pcm intel_cstate btbcm ee1004 iTCO_vendor_support btmtk mei_hdcp
>>>> [  439.650319]  mei_pxp pmt_class cfg80211 i2c_i801 snd_timer intel_uncore wmi_bmof gigabyte_wmi bluetooth pcspkr snd i2c_smbus mei_me soundcore mei idma64 rfkill intel_vsec intel_hid joydev acpi_tad acpi_pad sparse_keymap loop zram drm_gpuvm hid_logitech_hidpp i915 crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel nvme r8169 sha512_ssse3 nvme_core pinctrl_alderlake hid_logitech_dj ip6_tables ip_tables fuse
>>>> [  439.739896] Unloaded tainted modules: xe(OE):2 [last unloaded: xe(OE)]
>>>> [  439.786092] ---[ end trace 0000000000000000 ]---
>>>> [  439.790733] RIP: 0010:xe_ggtt_invalidate+0x150/0x260 [xe]
>>>> [  439.796254] Code: 48 8d 7d 18 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 b3 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 8b 5d 18 48 89 da 48 c1 ea 03 <80> 3c 02 00 0f 85 87 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 8b
>>>> [  439.815026] RSP: 0018:ffffc9000201eee8 EFLAGS: 00010246
>>>> [  439.820262] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
>>>> [  439.827407] RDX: 0000000000000000 RSI: ffffffffb6ae4040 RDI: ffff88818231a478
>>>> [  439.834549] RBP: ffff88818231a460 R08: 0000000000000001 R09: fffffbfff7563e5c
>>>> [  439.841691] R10: fffff52000403d92 R11: 0000000000000001 R12: ffff888185330028
>>>> [  439.848835] R13: ffffffffffffffff R14: 0000000000001000 R15: 0000000000000000
>>>> [  439.855978] FS:  00007f6c0fd92740(0000) GS:ffff888f76400000(0000) knlGS:0000000000000000
>>>> [  439.864072] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [  439.869824] CR2: 00007ffedd41de70 CR3: 00000001063ec000 CR4: 0000000000750ef0
>>>> [  439.876966] PKRU: 55555554
>>>>
>>>> Fixes: 27ee413bbc0b ("drm/xe: Do not grab forcewakes when issuing GGTT TLB invalidation via GuC")
>>>> Cc: Maarten Lankhorst <maarten.lankhorst at linux.intel.com>
>>>> Cc: Matthew Brost <matthew.brost at intel.com>
>>>> Signed-off-by: Rodrigo Vivi <rodrigo.vivi at intel.com>
>>>> ---
>>>> drivers/gpu/drm/xe/xe_ggtt.c                | 7 +++++++
>>>> drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c | 2 --
>>>> 2 files changed, 7 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c b/drivers/gpu/drm/xe/xe_ggtt.c
>>>> index 5e739513ab0a..717d0e76277a 100644
>>>> --- a/drivers/gpu/drm/xe/xe_ggtt.c
>>>> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
>>>> @@ -257,9 +257,16 @@ static void ggtt_invalidate_gt_tlb(struct xe_gt *gt)
>>>> 	if (!gt)
>>>> 		return;
>>>>
>>>> +	/*
>>>> +	 * Invalidation can happen when there's no in-flight work keeping the
>>>> +	 * GT awake.  We need to explicitly grab forcewake to ensure the GT
>>>> +	 * and GuC are accessible.
>>>
>>>
>>>aside from the null pointer deref, what exactly
>>>from this comment above that would not be true to make the patch
>>>correct?
>>
>>That's a very good question. maybe -ENOTENOUGHCOFFEE from my side...
>>my bisect took me to this patch that touches the RIP function:
>>[  439.140332] RIP: 0010:xe_ggtt_invalidate+0x150/0x260 [xe]
>>and I had to use this revert to be able to proceed with my work yesterday.
>>
>>But then, Matt pointed me to https://patchwork.freedesktop.org/patch/581522/?series=130786&rev=1
>>and applying that here also make things work...
>
>
>well, that first patch fixes the null pointer deref, but that is while
>printing an *error*. The error is still there. I see commit


scratch that... I misread the patch. Yes, makes sense, let's get that
in.

Lucas De Marchi

>27ee413bbc0b ("drm/xe: Do not grab forcewakes when issuing GGTT TLB
>invalidation via GuC")  removed the forcewake without proper explanation
>(at least to me), and now we are reproducing the error path (with the
>extra annoyance of the null deref).
>
>>although that doesn't touch the xe_ggtt_invalidate but only ggtt_invalidate_gt_tlb
>>
>>So, I'm still puzzled on the backlog above, but anyway,
>>what we need is not this revert, but get this
>>drm/xe: Fix NULL check in xe_ggtt_init()
>>pushed asap
>
>/me confused ... as I said above we'd still have the error, just not the
>null deref. Unless the other patches in that series are fixing something
>more.
>
>Lucas De Marchi


More information about the Intel-xe mailing list