Regression on drm-tip

Sun Mar 16 10:01:03 UTC 2025

> -----Original Message-----
> From: Baolu Lu <baolu.lu at linux.intel.com>
> Sent: Sunday, March 16, 2025 1:33 PM
> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah at intel.com>
> Cc: intel-gfx at lists.freedesktop.org; intel-xe at lists.freedesktop.org;
> iommu at lists.linux.dev; Kurmi, Suresh Kumar
> <suresh.kumar.kurmi at intel.com>; Saarinen, Jani <jani.saarinen at intel.com>;
> De Marchi, Lucas <lucas.demarchi at intel.com>
> Subject: Re: Regression on drm-tip
> 
> On 3/16/25 15:27, Borah, Chaitanya Kumar wrote:
> >
> >> -----Original Message-----
> >> From: Baolu Lu<baolu.lu at linux.intel.com>
> >> Sent: Sunday, March 16, 2025 8:04 AM
> >> To: Borah, Chaitanya Kumar<chaitanya.kumar.borah at intel.com>
> >> Cc:intel-gfx at lists.freedesktop.org;intel-xe at lists.freedesktop.org;
> >> iommu at lists.linux.dev
> >> Subject: Re: Regression on drm-tip
> >>
> >> On 3/14/25 17:04, Borah, Chaitanya Kumar wrote:
> >>>
> >>>> -----Original Message-----
> >>>> From: Baolu Lu<baolu.lu at linux.intel.com>
> >>>> Sent: Thursday, March 13, 2025 7:53 PM
> >>>> To: Borah, Chaitanya Kumar<chaitanya.kumar.borah at intel.com>
> >>>> Cc:baolu.lu at linux.intel.com;intel-gfx at lists.freedesktop.org; intel-
> >>>> xe at lists.freedesktop.org;iommu at lists.linux.dev
> >>>> Subject: Re: Regression on drm-tip
> >>>>
> >>>> On 2025/3/13 16:51, Borah, Chaitanya Kumar wrote:
> >>>>> Hello Lu,
> >>>>>
> >>>>> Hope you are doing well. I am Chaitanya from the linux graphics
> >>>>> team in
> >>>> Intel.
> >>>>> This mail is regarding a regression we are seeing in our CI
> >>>>> runs[1] on drm-tip
> >>>> repository.
> >>>>> ``````````````````````````````````````````````````````````````````
> >>>>> `` `` ``````````` <4>[    2.856622] WARNING: possible circular
> >>>>> locking dependency detected <4>[    2.856631]
> >>>>> 6.14.0-rc5-CI_DRM_16217-gc55ef90b69d3+ #1 Tainted: G          I
> >>>>> <4>[ 2.856642]
> >>>>> ------------------------------------------------------
> >>>>> <4>[    2.856650] swapper/0/1 is trying to acquire lock:
> >>>>> <4>[    2.856657] ffffffff8360ecc8
> >>>>> (iommu_probe_device_lock){+.+.}-{3:3}, at:
> >>>>> iommu_probe_device+0x1d/0x70 <4>[    2.856679]
> >>>>>                      but task is already holding lock:
> >>>>> <4>[    2.856686] ffff888102ab6fa8
> >>>>> (&device->physical_node_lock){+.+.}-{3:3}, at:
> >>>>> intel_iommu_init+0xea1/0x1220
> >>>>> ``````````````````````````````````````````````````````````````````
> >>>>> ``
> >>>>> ``
> >>>>> ```````````
> >>>>> Details log can be found in [2].
> >>>>>
> >>>>> After bisecting the tree, the following patch [3] seems to be the
> >>>>> first "bad" commit
> >>>>>
> >>>>> ``````````````````````````````````````````````````````````````````
> >>>>> ``
> >>>>> ``
> >>>>> ```````````````````````````````````
> >>>>> commit b150654f74bf0df8e6a7936d5ec51400d9ec06d8
> >>>>> Author: LuBaolumailto:baolu.lu at linux.intel.com
> >>>>> Date:   Fri Feb 28 18:27:26 2025 +0800
> >>>>>
> >>>>>        iommu/vt-d: Fix suspicious RCU usage
> >>>>>
> >>>>> ``````````````````````````````````````````````````````````````````
> >>>>> ``
> >>>>> ``
> >>>>> ```````````````````````````````````
> >>>>>
> >>>>> We also verified that if we revert the patch the issue is not seen.
> >>>>>
> >>>>> Could you please check why the patch causes this regression and
> >>>>> provide a
> >>>> fix if necessary?
> >>>>
> >>>> Can you please take a quick test to check if the following fix works?
> >>>>
> >>>> diff --git a/drivers/iommu/intel/dmar.c
> >>>> b/drivers/iommu/intel/dmar.c index
> >>>> e540092d664d..06debeaec643 100644
> >>>> --- a/drivers/iommu/intel/dmar.c
> >>>> +++ b/drivers/iommu/intel/dmar.c
> >>>> @@ -2051,8 +2051,13 @@ int enable_drhd_fault_handling(unsigned int
> >> cpu)
> >>>>                    if (iommu->irq || iommu->node != cpu_to_node(cpu))
> >>>>                            continue;
> >>>>
> >>>> +               /*
> >>>> +                * Call dmar_alloc_hwirq() with dmar_global_lock held,
> >>>> +                * could cause possible lock race condition.
> >>>> +                */
> >>>> +               up_read(&dmar_global_lock);
> >>>>                    ret = dmar_set_interrupt(iommu);
> >>>> -
> >>>> +               down_read(&dmar_global_lock);
> >>>>                    if (ret) {
> >>>>                            pr_err("DRHD %Lx: failed to enable
> >>>> fault, interrupt, ret
> >> %d\n",
> >>>>                                   (unsigned long
> >>>> long)drhd->reg_base_addr, ret);
> >>>>
> >>>> Thanks,
> >>>> baolu
> >>> We still see the issue with this change.
> >> I am attempting to reproduce this issue with my MTL machine. I pulled
> >> the test branch from:
> >>
> >> https://anongit.freedesktop.org/git/drm-tip.git
> >>
> >> and built the test kernel image using the configuration file from:
> >>
> >> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_16217/kconfig.txt
> >>
> >> But I did not observe the lockdep splat mentioned above after booting.
> >>
> >> Is there anything I might have missed?
> >>
> > +Suresh, Jani, Lucas
> >
> > We are seeing this only the skykale and kabylake on our CI runs.
> 
> If so, will below change make any difference?
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 85aa66ef4d61..ec2f385ae25b 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -3049,6 +3049,7 @@ static int __init
> probe_acpi_namespace_devices(void)
>                          if (dev->bus != &acpi_bus_type)
>                                  continue;
> 
> +                       up_read(&dmar_global_lock);
>                          adev = to_acpi_device(dev);
>                          mutex_lock(&adev->physical_node_lock);
>                          list_for_each_entry(pn, @@ -3058,6 +3059,7 @@ static int __init
> probe_acpi_namespace_devices(void)
>                                          break;
>                          }
>                          mutex_unlock(&adev->physical_node_lock);
> +                       down_read(&dmar_global_lock);
> 
>                          if (ret)
>                                  return ret;
> 

Thank you for the change. This seems to be working. Can we expect a fix patch soon?

Regards

Chaitanya

> Thanks,
> baolu