Regression on drm-tip
Baolu Lu
baolu.lu at linux.intel.com
Sun Mar 16 08:03:21 UTC 2025
On 3/16/25 15:27, Borah, Chaitanya Kumar wrote:
>
>> -----Original Message-----
>> From: Baolu Lu<baolu.lu at linux.intel.com>
>> Sent: Sunday, March 16, 2025 8:04 AM
>> To: Borah, Chaitanya Kumar<chaitanya.kumar.borah at intel.com>
>> Cc:intel-gfx at lists.freedesktop.org;intel-xe at lists.freedesktop.org;
>> iommu at lists.linux.dev
>> Subject: Re: Regression on drm-tip
>>
>> On 3/14/25 17:04, Borah, Chaitanya Kumar wrote:
>>>
>>>> -----Original Message-----
>>>> From: Baolu Lu<baolu.lu at linux.intel.com>
>>>> Sent: Thursday, March 13, 2025 7:53 PM
>>>> To: Borah, Chaitanya Kumar<chaitanya.kumar.borah at intel.com>
>>>> Cc:baolu.lu at linux.intel.com;intel-gfx at lists.freedesktop.org; intel-
>>>> xe at lists.freedesktop.org;iommu at lists.linux.dev
>>>> Subject: Re: Regression on drm-tip
>>>>
>>>> On 2025/3/13 16:51, Borah, Chaitanya Kumar wrote:
>>>>> Hello Lu,
>>>>>
>>>>> Hope you are doing well. I am Chaitanya from the linux graphics team
>>>>> in
>>>> Intel.
>>>>> This mail is regarding a regression we are seeing in our CI runs[1]
>>>>> on drm-tip
>>>> repository.
>>>>> ````````````````````````````````````````````````````````````````````
>>>>> `` ``````````` <4>[ 2.856622] WARNING: possible circular locking
>>>>> dependency detected <4>[ 2.856631]
>>>>> 6.14.0-rc5-CI_DRM_16217-gc55ef90b69d3+ #1 Tainted: G I <4>[
>>>>> 2.856642] ------------------------------------------------------
>>>>> <4>[ 2.856650] swapper/0/1 is trying to acquire lock:
>>>>> <4>[ 2.856657] ffffffff8360ecc8
>>>>> (iommu_probe_device_lock){+.+.}-{3:3}, at:
>>>>> iommu_probe_device+0x1d/0x70 <4>[ 2.856679]
>>>>> but task is already holding lock:
>>>>> <4>[ 2.856686] ffff888102ab6fa8
>>>>> (&device->physical_node_lock){+.+.}-{3:3}, at:
>>>>> intel_iommu_init+0xea1/0x1220
>>>>> ````````````````````````````````````````````````````````````````````
>>>>> ``
>>>>> ```````````
>>>>> Details log can be found in [2].
>>>>>
>>>>> After bisecting the tree, the following patch [3] seems to be the
>>>>> first "bad" commit
>>>>>
>>>>> ````````````````````````````````````````````````````````````````````
>>>>> ``
>>>>> ```````````````````````````````````
>>>>> commit b150654f74bf0df8e6a7936d5ec51400d9ec06d8
>>>>> Author: LuBaolumailto:baolu.lu at linux.intel.com
>>>>> Date: Fri Feb 28 18:27:26 2025 +0800
>>>>>
>>>>> iommu/vt-d: Fix suspicious RCU usage
>>>>>
>>>>> ````````````````````````````````````````````````````````````````````
>>>>> ``
>>>>> ```````````````````````````````````
>>>>>
>>>>> We also verified that if we revert the patch the issue is not seen.
>>>>>
>>>>> Could you please check why the patch causes this regression and
>>>>> provide a
>>>> fix if necessary?
>>>>
>>>> Can you please take a quick test to check if the following fix works?
>>>>
>>>> diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
>>>> index
>>>> e540092d664d..06debeaec643 100644
>>>> --- a/drivers/iommu/intel/dmar.c
>>>> +++ b/drivers/iommu/intel/dmar.c
>>>> @@ -2051,8 +2051,13 @@ int enable_drhd_fault_handling(unsigned int
>> cpu)
>>>> if (iommu->irq || iommu->node != cpu_to_node(cpu))
>>>> continue;
>>>>
>>>> + /*
>>>> + * Call dmar_alloc_hwirq() with dmar_global_lock held,
>>>> + * could cause possible lock race condition.
>>>> + */
>>>> + up_read(&dmar_global_lock);
>>>> ret = dmar_set_interrupt(iommu);
>>>> -
>>>> + down_read(&dmar_global_lock);
>>>> if (ret) {
>>>> pr_err("DRHD %Lx: failed to enable fault, interrupt, ret
>> %d\n",
>>>> (unsigned long
>>>> long)drhd->reg_base_addr, ret);
>>>>
>>>> Thanks,
>>>> baolu
>>> We still see the issue with this change.
>> I am attempting to reproduce this issue with my MTL machine. I pulled the
>> test branch from:
>>
>> https://anongit.freedesktop.org/git/drm-tip.git
>>
>> and built the test kernel image using the configuration file from:
>>
>> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_16217/kconfig.txt
>>
>> But I did not observe the lockdep splat mentioned above after booting.
>>
>> Is there anything I might have missed?
>>
> +Suresh, Jani, Lucas
>
> We are seeing this only the skykale and kabylake on our CI runs.
If so, will below change make any difference?
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 85aa66ef4d61..ec2f385ae25b 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -3049,6 +3049,7 @@ static int __init probe_acpi_namespace_devices(void)
if (dev->bus != &acpi_bus_type)
continue;
+ up_read(&dmar_global_lock);
adev = to_acpi_device(dev);
mutex_lock(&adev->physical_node_lock);
list_for_each_entry(pn,
@@ -3058,6 +3059,7 @@ static int __init probe_acpi_namespace_devices(void)
break;
}
mutex_unlock(&adev->physical_node_lock);
+ down_read(&dmar_global_lock);
if (ret)
return ret;
Thanks,
baolu
More information about the Intel-xe
mailing list