[PATCH] drm/xe/bmg: Fix NULL-ptr deref during probe on SKUs with no display

Jani Nikula jani.nikula at intel.com
Tue Jun 3 08:19:12 UTC 2025


On Tue, 03 Jun 2025, "Kasireddy, Vivek" <vivek.kasireddy at intel.com> wrote:
> Hi Matt,
>
>> Subject: Re: [PATCH] drm/xe/bmg: Fix NULL-ptr deref during probe on SKUs
>> with no display
>> 
>> On Mon, Jun 02, 2025 at 03:11:18PM -0700, Vivek Kasireddy wrote:
>> > Some SKUs (or variants) of BMG do not have display capabilities and
>> > are intended to be used as compute-only devices. The Xe driver is
>> > expected to work with such devices by probing/initializing all
>> > other capabilities except display. However, the following crash is
>> > seen when Xe tries to load:
>> 
>> Can you provide the full dmesg log?  It's not clear exactly what "no
>> display" means in this case.  What's making the driver think there's no
>> display?  Display GMD_ID reads back as 0?  GMD_ID is normal but all
>> pipes are fused off?  Something else?
> At-least with the BMG SKU I am testing with, GMD_ID_DISPLAY reads back 0,
> and as a result, probe_gmdid_display() returns NULL.

Hmm, do we not have BMG in xe CI? Why did this not blow up then? Did I
miss something?

>> 
>> We already support other platforms with various flavors of "no display"
>> so I'm trying to understand what makes this one different.
> It looks like this is a regression that is caused by a recent patch. Specifically, reverting
> 5a9f299f956e ("drm/xe/display: use xe->display to decide whether to do anything")
> also seems to avoid the crash.

Right. Xe sets

	xe->info.probe_display = false;

for all no display cases, and uses that in subsequent checks.

i915 OTOH retains non-NULL display pointer, and uses HAS_DISPLAY() for
the checks.

I don't think the patch at hand is the right fix for i915, and will need
more work. This was also not sent to intel-gfx and thus lacks i915 CI.

This might be the minimal, safer xe fix for starters:

diff --git a/drivers/gpu/drm/xe/display/xe_display.c b/drivers/gpu/drm/xe/display/xe_display.c
index 3f92bf51813e..87262acc9513 100644
--- a/drivers/gpu/drm/xe/display/xe_display.c
+++ b/drivers/gpu/drm/xe/display/xe_display.c
@@ -538,10 +538,10 @@ int xe_display_probe(struct xe_device *xe)
 	if (err)
 		return err;
 
-	xe->display = display;
-
-	if (has_display(xe))
+	if (has_display(xe)) {
+		xe->display = display;
 		return 0;
+	}
 
 no_display:
 	xe->info.probe_display = false;

---

The alternative is to revert 5a9f299f956e ("drm/xe/display: use
xe->display to decide whether to do anything").

No matter what, xe and i915 need unification for no display cases. Xe
oopses on display being non-NULL for no display, and i915 oopses on
display being NULL for no display. ;D

Further comments below inline.

>> What is the
>> actual NULL pointer you're hitting here?
> I believe this is a bit vague, but inside xe_display_flush_cleanup_work(), the line
> spin_lock(&crtc->base.commit_lock);
> appears to be causing the Null-ptr-deref as the crtc object holds some garbage value. 
>
> Thanks,
> Vivek 
>
>> 
>> 
>> Matt
>> 
>> >
>> > [  115.582833] BUG: kernel NULL pointer dereference, address:
>> 00000000000005d0
>> > [  115.589775] #PF: supervisor write access in kernel mode
>> > [  115.594976] #PF: error_code(0x0002) - not-present page
>> > [  115.600088] PGD 0 P4D 0
>> > [  115.602617] Oops: Oops: 0002 [#1] SMP
>> > [  115.606267] CPU: 14 UID: 0 PID: 1547 Comm: kworker/14:3 Tainted: G     U
>> E       6.15.0-local+ #62 PREEMPT(voluntary)
>> > [  115.617332] Tainted: [U]=USER, [E]=UNSIGNED_MODULE
>> > [  115.622100] Hardware name: Intel Corporation Meteor Lake Client
>> Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS
>> MTLPEMI1.R00.3471.D49.2401260852 01/26/2024
>> > [  115.635314] Workqueue: pm pm_runtime_work
>> > [  115.639309] RIP: 0010:_raw_spin_lock+0x17/0x30
>> > [  115.662382] RSP: 0018:ffffd13f82e7bc30 EFLAGS: 00010246
>> > [  115.667581] RAX: 0000000000000000 RBX: ffff8be919076000 RCX:
>> 0000000000000002
>> > [  115.674675] RDX: 0000000000000001 RSI: 000000000000004b RDI:
>> 00000000000005d0
>> > [  115.681775] RBP: ffffd13f82e7bc60 R08: ffffd13f82e7bb00 R09:
>> ffff8beb0c1b06c0
>> > [  115.688869] R10: ffff8be7c034f4c0 R11: fefefefefefefeff R12: fffffffffffffff0
>> > [  115.695965] R13: ffff8be9190762e8 R14: ffff8be919077798 R15:
>> 00000000000005d0
>> > [  115.703062] FS:  0000000000000000(0000) GS:ffff8beb552b6000(0000)
>> knlGS:0000000000000000
>> > [  115.711106] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> > [  115.716826] CR2: 00000000000005d0 CR3: 000000024c68d002 CR4:
>> 0000000000f72ef0
>> > [  115.723921] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>> > [  115.731015] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7:
>> 0000000000000400
>> > [  115.738113] PKRU: 55555554
>> > [  115.740816] Call Trace:
>> > [  115.743258]  <TASK>
>> > [  115.745363]  ? xe_display_flush_cleanup_work+0x92/0x120 [xe]
>> > [  115.751102]  xe_display_pm_runtime_suspend+0x42/0x80 [xe]
>> > [  115.756542]  xe_pm_runtime_suspend+0x11b/0x1b0 [xe]
>> > [  115.761463]  xe_pci_runtime_suspend+0x23/0xd0 [xe]
>> > [  115.766291]  pci_pm_runtime_suspend+0x6b/0x1a0
>> > [  115.770717]  ? pci_pm_thaw_noirq+0xa0/0xa0
>> > [  115.774797]  __rpm_callback+0x48/0x1e0
>> > [  115.778531]  ? pci_pm_thaw_noirq+0xa0/0xa0
>> > [  115.782614]  rpm_callback+0x66/0x70
>> > [  115.786090]  ? pci_pm_thaw_noirq+0xa0/0xa0
>> > [  115.790173]  rpm_suspend+0xe1/0x5e0
>> > [  115.793647]  ? psi_task_switch+0xb8/0x200
>> > [  115.797643]  ? finish_task_switch.isra.0+0x8d/0x270
>> > [  115.802502]  pm_runtime_work+0xa6/0xc0
>> > [  115.806238]  process_one_work+0x186/0x350
>> > [  115.810234]  worker_thread+0x33a/0x480
>> > [  115.813968]  ? process_one_work+0x350/0x350
>> > [  115.818132]  kthread+0x10c/0x220
>> > [  115.821350]  ? kthreads_online_cpu+0x120/0x120
>> > [  115.825774]  ret_from_fork+0x3a/0x60
>> > [  115.829339]  ? kthreads_online_cpu+0x120/0x120
>> > [  115.833768]  ret_from_fork_asm+0x11/0x20
>> > [  115.829339]  ? kthreads_online_cpu+0x120/0x120
>> > [  115.833768]  ret_from_fork_asm+0x11/0x20
>> > [  115.837680]  </TASK>
>> > [  115.839907]  acpi_tad(E) drm(E)
>> > [  115.931629] CR2: 00000000000005d0
>> > [  115.934935] ---[ end trace 0000000000000000 ]---
>> > [  115.939531] RIP: 0010:_raw_spin_lock+0x17/0x30
>> >
>> > Therefore, to fix this issue, we need to make intel_display_device_probe()
>> > return NULL if it does not detect the presence of display and ensure
>> > that its callers check for non-NULL display object before probing other
>> > display related capabilities.
>> >
>> > Cc: Jani Nikula <jani.nikula at intel.com>
>> > Cc: Matt Roper <matthew.d.roper at intel.com>
>> > Cc: Lucas De Marchi <lucas.demarchi at intel.com>
>> > Signed-off-by: Vivek Kasireddy <vivek.kasireddy at intel.com>
>> > ---
>> >  drivers/gpu/drm/i915/display/intel_display_device.c | 5 ++---
>> >  drivers/gpu/drm/i915/i915_driver.c                  | 4 ++--
>> >  drivers/gpu/drm/i915/selftests/mock_gem_device.c    | 4 ++--
>> >  drivers/gpu/drm/xe/display/xe_display.c             | 2 ++
>> >  4 files changed, 8 insertions(+), 7 deletions(-)
>> >
>> > diff --git a/drivers/gpu/drm/i915/display/intel_display_device.c
>> b/drivers/gpu/drm/i915/display/intel_display_device.c
>> > index 1d8c2036d967..c9494cbf5ba7 100644
>> > --- a/drivers/gpu/drm/i915/display/intel_display_device.c
>> > +++ b/drivers/gpu/drm/i915/display/intel_display_device.c
>> > @@ -1705,9 +1705,8 @@ struct intel_display
>> *intel_display_device_probe(struct pci_dev *pdev)
>> >  	return display;
>> >
>> >  no_display:
>> > -	DISPLAY_INFO(display) = &no_display;
>> > -
>> > -	return display;
>> > +	kfree(display);
>> > +	return NULL;

I quite dislike functions that return distinct error and NULL pointers
for different cases. It should probably return -ENODEV for no display,
and callers can handle that as an okay condition, instead of special
casing NULL for that.

>> >  }
>> >
>> >  void intel_display_device_remove(struct intel_display *display)
>> > diff --git a/drivers/gpu/drm/i915/i915_driver.c
>> b/drivers/gpu/drm/i915/i915_driver.c
>> > index 3b0bda74697d..0bf8125aee8b 100644
>> > --- a/drivers/gpu/drm/i915/i915_driver.c
>> > +++ b/drivers/gpu/drm/i915/i915_driver.c
>> > @@ -757,8 +757,8 @@ i915_driver_create(struct pci_dev *pdev, const
>> struct pci_device_id *ent)
>> >  	display = intel_display_device_probe(pdev);
>> >  	if (IS_ERR(display))
>> >  		return ERR_CAST(display);
>> > -
>> > -	i915->display = display;
>> > +	if (display)
>> > +		i915->display = display;

Right now, i915 has never been tested with i915->display being
NULL. It *will* oops.

After that's been fixed, and it may be a bunch of work, I think this
should be something along the lines of:

	display = intel_display_device_probe(pdev);
	if (PTR_ERR(display) == -ENODEV)
		display = NULL;
	if (IS_ERR(display))
		return ERR_CAST(display);

i.e. returning ERR_PTR(-ENODEV) instead of NULL from
intel_display_device_probe().


BR,
Jani.


>> >
>> >  	return i915;
>> >  }
>> > diff --git a/drivers/gpu/drm/i915/selftests/mock_gem_device.c
>> b/drivers/gpu/drm/i915/selftests/mock_gem_device.c
>> > index fb8751bd5df0..cb1928d52869 100644
>> > --- a/drivers/gpu/drm/i915/selftests/mock_gem_device.c
>> > +++ b/drivers/gpu/drm/i915/selftests/mock_gem_device.c
>> > @@ -186,8 +186,8 @@ struct drm_i915_private *mock_gem_device(void)
>> >  	display = intel_display_device_probe(pdev);
>> >  	if (IS_ERR(display))
>> >  		goto err_device;
>> > -
>> > -	i915->display = display;
>> > +	if (display)
>> > +		i915->display = display;
>> >
>> >  	dev_pm_domain_set(&pdev->dev, &pm_domain);
>> >  	pm_runtime_enable(&pdev->dev);
>> > diff --git a/drivers/gpu/drm/xe/display/xe_display.c
>> b/drivers/gpu/drm/xe/display/xe_display.c
>> > index 3f92bf51813e..027690e4992b 100644
>> > --- a/drivers/gpu/drm/xe/display/xe_display.c
>> > +++ b/drivers/gpu/drm/xe/display/xe_display.c
>> > @@ -533,6 +533,8 @@ int xe_display_probe(struct xe_device *xe)
>> >  	display = intel_display_device_probe(pdev);
>> >  	if (IS_ERR(display))
>> >  		return PTR_ERR(display);
>> > +	if (!display)
>> > +		goto no_display;
>> >
>> >  	err = drmm_add_action_or_reset(&xe->drm, display_device_remove,
>> display);
>> >  	if (err)
>> > --
>> > 2.49.0
>> >
>> 
>> --
>> Matt Roper
>> Graphics Software Engineer
>> Linux GPU Platform Enablement
>> Intel Corporation

-- 
Jani Nikula, Intel


More information about the Intel-xe mailing list