[PATCH] drm/xe/bmg: Fix NULL-ptr deref during probe on SKUs with no display
Jani Nikula
jani.nikula at intel.com
Tue Jun 3 08:19:12 UTC 2025
On Tue, 03 Jun 2025, "Kasireddy, Vivek" <vivek.kasireddy at intel.com> wrote:
> Hi Matt,
>
>> Subject: Re: [PATCH] drm/xe/bmg: Fix NULL-ptr deref during probe on SKUs
>> with no display
>>
>> On Mon, Jun 02, 2025 at 03:11:18PM -0700, Vivek Kasireddy wrote:
>> > Some SKUs (or variants) of BMG do not have display capabilities and
>> > are intended to be used as compute-only devices. The Xe driver is
>> > expected to work with such devices by probing/initializing all
>> > other capabilities except display. However, the following crash is
>> > seen when Xe tries to load:
>>
>> Can you provide the full dmesg log? It's not clear exactly what "no
>> display" means in this case. What's making the driver think there's no
>> display? Display GMD_ID reads back as 0? GMD_ID is normal but all
>> pipes are fused off? Something else?
> At-least with the BMG SKU I am testing with, GMD_ID_DISPLAY reads back 0,
> and as a result, probe_gmdid_display() returns NULL.
Hmm, do we not have BMG in xe CI? Why did this not blow up then? Did I
miss something?
>>
>> We already support other platforms with various flavors of "no display"
>> so I'm trying to understand what makes this one different.
> It looks like this is a regression that is caused by a recent patch. Specifically, reverting
> 5a9f299f956e ("drm/xe/display: use xe->display to decide whether to do anything")
> also seems to avoid the crash.
Right. Xe sets
xe->info.probe_display = false;
for all no display cases, and uses that in subsequent checks.
i915 OTOH retains non-NULL display pointer, and uses HAS_DISPLAY() for
the checks.
I don't think the patch at hand is the right fix for i915, and will need
more work. This was also not sent to intel-gfx and thus lacks i915 CI.
This might be the minimal, safer xe fix for starters:
diff --git a/drivers/gpu/drm/xe/display/xe_display.c b/drivers/gpu/drm/xe/display/xe_display.c
index 3f92bf51813e..87262acc9513 100644
--- a/drivers/gpu/drm/xe/display/xe_display.c
+++ b/drivers/gpu/drm/xe/display/xe_display.c
@@ -538,10 +538,10 @@ int xe_display_probe(struct xe_device *xe)
if (err)
return err;
- xe->display = display;
-
- if (has_display(xe))
+ if (has_display(xe)) {
+ xe->display = display;
return 0;
+ }
no_display:
xe->info.probe_display = false;
---
The alternative is to revert 5a9f299f956e ("drm/xe/display: use
xe->display to decide whether to do anything").
No matter what, xe and i915 need unification for no display cases. Xe
oopses on display being non-NULL for no display, and i915 oopses on
display being NULL for no display. ;D
Further comments below inline.
>> What is the
>> actual NULL pointer you're hitting here?
> I believe this is a bit vague, but inside xe_display_flush_cleanup_work(), the line
> spin_lock(&crtc->base.commit_lock);
> appears to be causing the Null-ptr-deref as the crtc object holds some garbage value.
>
> Thanks,
> Vivek
>
>>
>>
>> Matt
>>
>> >
>> > [ 115.582833] BUG: kernel NULL pointer dereference, address:
>> 00000000000005d0
>> > [ 115.589775] #PF: supervisor write access in kernel mode
>> > [ 115.594976] #PF: error_code(0x0002) - not-present page
>> > [ 115.600088] PGD 0 P4D 0
>> > [ 115.602617] Oops: Oops: 0002 [#1] SMP
>> > [ 115.606267] CPU: 14 UID: 0 PID: 1547 Comm: kworker/14:3 Tainted: G U
>> E 6.15.0-local+ #62 PREEMPT(voluntary)
>> > [ 115.617332] Tainted: [U]=USER, [E]=UNSIGNED_MODULE
>> > [ 115.622100] Hardware name: Intel Corporation Meteor Lake Client
>> Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS
>> MTLPEMI1.R00.3471.D49.2401260852 01/26/2024
>> > [ 115.635314] Workqueue: pm pm_runtime_work
>> > [ 115.639309] RIP: 0010:_raw_spin_lock+0x17/0x30
>> > [ 115.662382] RSP: 0018:ffffd13f82e7bc30 EFLAGS: 00010246
>> > [ 115.667581] RAX: 0000000000000000 RBX: ffff8be919076000 RCX:
>> 0000000000000002
>> > [ 115.674675] RDX: 0000000000000001 RSI: 000000000000004b RDI:
>> 00000000000005d0
>> > [ 115.681775] RBP: ffffd13f82e7bc60 R08: ffffd13f82e7bb00 R09:
>> ffff8beb0c1b06c0
>> > [ 115.688869] R10: ffff8be7c034f4c0 R11: fefefefefefefeff R12: fffffffffffffff0
>> > [ 115.695965] R13: ffff8be9190762e8 R14: ffff8be919077798 R15:
>> 00000000000005d0
>> > [ 115.703062] FS: 0000000000000000(0000) GS:ffff8beb552b6000(0000)
>> knlGS:0000000000000000
>> > [ 115.711106] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> > [ 115.716826] CR2: 00000000000005d0 CR3: 000000024c68d002 CR4:
>> 0000000000f72ef0
>> > [ 115.723921] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>> > [ 115.731015] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7:
>> 0000000000000400
>> > [ 115.738113] PKRU: 55555554
>> > [ 115.740816] Call Trace:
>> > [ 115.743258] <TASK>
>> > [ 115.745363] ? xe_display_flush_cleanup_work+0x92/0x120 [xe]
>> > [ 115.751102] xe_display_pm_runtime_suspend+0x42/0x80 [xe]
>> > [ 115.756542] xe_pm_runtime_suspend+0x11b/0x1b0 [xe]
>> > [ 115.761463] xe_pci_runtime_suspend+0x23/0xd0 [xe]
>> > [ 115.766291] pci_pm_runtime_suspend+0x6b/0x1a0
>> > [ 115.770717] ? pci_pm_thaw_noirq+0xa0/0xa0
>> > [ 115.774797] __rpm_callback+0x48/0x1e0
>> > [ 115.778531] ? pci_pm_thaw_noirq+0xa0/0xa0
>> > [ 115.782614] rpm_callback+0x66/0x70
>> > [ 115.786090] ? pci_pm_thaw_noirq+0xa0/0xa0
>> > [ 115.790173] rpm_suspend+0xe1/0x5e0
>> > [ 115.793647] ? psi_task_switch+0xb8/0x200
>> > [ 115.797643] ? finish_task_switch.isra.0+0x8d/0x270
>> > [ 115.802502] pm_runtime_work+0xa6/0xc0
>> > [ 115.806238] process_one_work+0x186/0x350
>> > [ 115.810234] worker_thread+0x33a/0x480
>> > [ 115.813968] ? process_one_work+0x350/0x350
>> > [ 115.818132] kthread+0x10c/0x220
>> > [ 115.821350] ? kthreads_online_cpu+0x120/0x120
>> > [ 115.825774] ret_from_fork+0x3a/0x60
>> > [ 115.829339] ? kthreads_online_cpu+0x120/0x120
>> > [ 115.833768] ret_from_fork_asm+0x11/0x20
>> > [ 115.829339] ? kthreads_online_cpu+0x120/0x120
>> > [ 115.833768] ret_from_fork_asm+0x11/0x20
>> > [ 115.837680] </TASK>
>> > [ 115.839907] acpi_tad(E) drm(E)
>> > [ 115.931629] CR2: 00000000000005d0
>> > [ 115.934935] ---[ end trace 0000000000000000 ]---
>> > [ 115.939531] RIP: 0010:_raw_spin_lock+0x17/0x30
>> >
>> > Therefore, to fix this issue, we need to make intel_display_device_probe()
>> > return NULL if it does not detect the presence of display and ensure
>> > that its callers check for non-NULL display object before probing other
>> > display related capabilities.
>> >
>> > Cc: Jani Nikula <jani.nikula at intel.com>
>> > Cc: Matt Roper <matthew.d.roper at intel.com>
>> > Cc: Lucas De Marchi <lucas.demarchi at intel.com>
>> > Signed-off-by: Vivek Kasireddy <vivek.kasireddy at intel.com>
>> > ---
>> > drivers/gpu/drm/i915/display/intel_display_device.c | 5 ++---
>> > drivers/gpu/drm/i915/i915_driver.c | 4 ++--
>> > drivers/gpu/drm/i915/selftests/mock_gem_device.c | 4 ++--
>> > drivers/gpu/drm/xe/display/xe_display.c | 2 ++
>> > 4 files changed, 8 insertions(+), 7 deletions(-)
>> >
>> > diff --git a/drivers/gpu/drm/i915/display/intel_display_device.c
>> b/drivers/gpu/drm/i915/display/intel_display_device.c
>> > index 1d8c2036d967..c9494cbf5ba7 100644
>> > --- a/drivers/gpu/drm/i915/display/intel_display_device.c
>> > +++ b/drivers/gpu/drm/i915/display/intel_display_device.c
>> > @@ -1705,9 +1705,8 @@ struct intel_display
>> *intel_display_device_probe(struct pci_dev *pdev)
>> > return display;
>> >
>> > no_display:
>> > - DISPLAY_INFO(display) = &no_display;
>> > -
>> > - return display;
>> > + kfree(display);
>> > + return NULL;
I quite dislike functions that return distinct error and NULL pointers
for different cases. It should probably return -ENODEV for no display,
and callers can handle that as an okay condition, instead of special
casing NULL for that.
>> > }
>> >
>> > void intel_display_device_remove(struct intel_display *display)
>> > diff --git a/drivers/gpu/drm/i915/i915_driver.c
>> b/drivers/gpu/drm/i915/i915_driver.c
>> > index 3b0bda74697d..0bf8125aee8b 100644
>> > --- a/drivers/gpu/drm/i915/i915_driver.c
>> > +++ b/drivers/gpu/drm/i915/i915_driver.c
>> > @@ -757,8 +757,8 @@ i915_driver_create(struct pci_dev *pdev, const
>> struct pci_device_id *ent)
>> > display = intel_display_device_probe(pdev);
>> > if (IS_ERR(display))
>> > return ERR_CAST(display);
>> > -
>> > - i915->display = display;
>> > + if (display)
>> > + i915->display = display;
Right now, i915 has never been tested with i915->display being
NULL. It *will* oops.
After that's been fixed, and it may be a bunch of work, I think this
should be something along the lines of:
display = intel_display_device_probe(pdev);
if (PTR_ERR(display) == -ENODEV)
display = NULL;
if (IS_ERR(display))
return ERR_CAST(display);
i.e. returning ERR_PTR(-ENODEV) instead of NULL from
intel_display_device_probe().
BR,
Jani.
>> >
>> > return i915;
>> > }
>> > diff --git a/drivers/gpu/drm/i915/selftests/mock_gem_device.c
>> b/drivers/gpu/drm/i915/selftests/mock_gem_device.c
>> > index fb8751bd5df0..cb1928d52869 100644
>> > --- a/drivers/gpu/drm/i915/selftests/mock_gem_device.c
>> > +++ b/drivers/gpu/drm/i915/selftests/mock_gem_device.c
>> > @@ -186,8 +186,8 @@ struct drm_i915_private *mock_gem_device(void)
>> > display = intel_display_device_probe(pdev);
>> > if (IS_ERR(display))
>> > goto err_device;
>> > -
>> > - i915->display = display;
>> > + if (display)
>> > + i915->display = display;
>> >
>> > dev_pm_domain_set(&pdev->dev, &pm_domain);
>> > pm_runtime_enable(&pdev->dev);
>> > diff --git a/drivers/gpu/drm/xe/display/xe_display.c
>> b/drivers/gpu/drm/xe/display/xe_display.c
>> > index 3f92bf51813e..027690e4992b 100644
>> > --- a/drivers/gpu/drm/xe/display/xe_display.c
>> > +++ b/drivers/gpu/drm/xe/display/xe_display.c
>> > @@ -533,6 +533,8 @@ int xe_display_probe(struct xe_device *xe)
>> > display = intel_display_device_probe(pdev);
>> > if (IS_ERR(display))
>> > return PTR_ERR(display);
>> > + if (!display)
>> > + goto no_display;
>> >
>> > err = drmm_add_action_or_reset(&xe->drm, display_device_remove,
>> display);
>> > if (err)
>> > --
>> > 2.49.0
>> >
>>
>> --
>> Matt Roper
>> Graphics Software Engineer
>> Linux GPU Platform Enablement
>> Intel Corporation
--
Jani Nikula, Intel
More information about the Intel-xe
mailing list