Regression on drm-tip
Baolu Lu
baolu.lu at linux.intel.com
Mon Mar 17 04:04:40 UTC 2025
On 3/16/25 18:01, Borah, Chaitanya Kumar wrote:
>
>> -----Original Message-----
>> From: Baolu Lu<baolu.lu at linux.intel.com>
>> Sent: Sunday, March 16, 2025 1:33 PM
>> To: Borah, Chaitanya Kumar<chaitanya.kumar.borah at intel.com>
>> Cc:intel-gfx at lists.freedesktop.org;intel-xe at lists.freedesktop.org;
>> iommu at lists.linux.dev; Kurmi, Suresh Kumar
>> <suresh.kumar.kurmi at intel.com>; Saarinen, Jani<jani.saarinen at intel.com>;
>> De Marchi, Lucas<lucas.demarchi at intel.com>
>> Subject: Re: Regression on drm-tip
>>
>> On 3/16/25 15:27, Borah, Chaitanya Kumar wrote:
>>>> -----Original Message-----
>>>> From: Baolu Lu<baolu.lu at linux.intel.com>
>>>> Sent: Sunday, March 16, 2025 8:04 AM
>>>> To: Borah, Chaitanya Kumar<chaitanya.kumar.borah at intel.com>
>>>> Cc:intel-gfx at lists.freedesktop.org;intel-xe at lists.freedesktop.org;
>>>> iommu at lists.linux.dev
>>>> Subject: Re: Regression on drm-tip
>>>>
>>>> On 3/14/25 17:04, Borah, Chaitanya Kumar wrote:
>>>>>> -----Original Message-----
>>>>>> From: Baolu Lu<baolu.lu at linux.intel.com>
>>>>>> Sent: Thursday, March 13, 2025 7:53 PM
>>>>>> To: Borah, Chaitanya Kumar<chaitanya.kumar.borah at intel.com>
>>>>>> Cc:baolu.lu at linux.intel.com;intel-gfx at lists.freedesktop.org; intel-
>>>>>> xe at lists.freedesktop.org;iommu at lists.linux.dev
>>>>>> Subject: Re: Regression on drm-tip
>>>>>>
>>>>>> On 2025/3/13 16:51, Borah, Chaitanya Kumar wrote:
>>>>>>> Hello Lu,
>>>>>>>
>>>>>>> Hope you are doing well. I am Chaitanya from the linux graphics
>>>>>>> team in
>>>>>> Intel.
>>>>>>> This mail is regarding a regression we are seeing in our CI
>>>>>>> runs[1] on drm-tip
>>>>>> repository.
>>>>>>> ``````````````````````````````````````````````````````````````````
>>>>>>> `` `` ``````````` <4>[ 2.856622] WARNING: possible circular
>>>>>>> locking dependency detected <4>[ 2.856631]
>>>>>>> 6.14.0-rc5-CI_DRM_16217-gc55ef90b69d3+ #1 Tainted: G I
>>>>>>> <4>[ 2.856642]
>>>>>>> ------------------------------------------------------
>>>>>>> <4>[ 2.856650] swapper/0/1 is trying to acquire lock:
>>>>>>> <4>[ 2.856657] ffffffff8360ecc8
>>>>>>> (iommu_probe_device_lock){+.+.}-{3:3}, at:
>>>>>>> iommu_probe_device+0x1d/0x70 <4>[ 2.856679]
>>>>>>> but task is already holding lock:
>>>>>>> <4>[ 2.856686] ffff888102ab6fa8
>>>>>>> (&device->physical_node_lock){+.+.}-{3:3}, at:
>>>>>>> intel_iommu_init+0xea1/0x1220
>>>>>>> ``````````````````````````````````````````````````````````````````
>>>>>>> ``
>>>>>>> ``
>>>>>>> ```````````
>>>>>>> Details log can be found in [2].
>>>>>>>
>>>>>>> After bisecting the tree, the following patch [3] seems to be the
>>>>>>> first "bad" commit
>>>>>>>
>>>>>>> ``````````````````````````````````````````````````````````````````
>>>>>>> ``
>>>>>>> ``
>>>>>>> ```````````````````````````````````
>>>>>>> commit b150654f74bf0df8e6a7936d5ec51400d9ec06d8
>>>>>>> Author:LuBaolumailto:baolu.lu at linux.intel.com
>>>>>>> Date: Fri Feb 28 18:27:26 2025 +0800
>>>>>>>
>>>>>>> iommu/vt-d: Fix suspicious RCU usage
>>>>>>>
>>>>>>> ``````````````````````````````````````````````````````````````````
>>>>>>> ``
>>>>>>> ``
>>>>>>> ```````````````````````````````````
>>>>>>>
>>>>>>> We also verified that if we revert the patch the issue is not seen.
>>>>>>>
>>>>>>> Could you please check why the patch causes this regression and
>>>>>>> provide a
>>>>>> fix if necessary?
>>>>>>
>>>>>> Can you please take a quick test to check if the following fix works?
>>>>>>
>>>>>> diff --git a/drivers/iommu/intel/dmar.c
>>>>>> b/drivers/iommu/intel/dmar.c index
>>>>>> e540092d664d..06debeaec643 100644
>>>>>> --- a/drivers/iommu/intel/dmar.c
>>>>>> +++ b/drivers/iommu/intel/dmar.c
>>>>>> @@ -2051,8 +2051,13 @@ int enable_drhd_fault_handling(unsigned int
>>>> cpu)
>>>>>> if (iommu->irq || iommu->node != cpu_to_node(cpu))
>>>>>> continue;
>>>>>>
>>>>>> + /*
>>>>>> + * Call dmar_alloc_hwirq() with dmar_global_lock held,
>>>>>> + * could cause possible lock race condition.
>>>>>> + */
>>>>>> + up_read(&dmar_global_lock);
>>>>>> ret = dmar_set_interrupt(iommu);
>>>>>> -
>>>>>> + down_read(&dmar_global_lock);
>>>>>> if (ret) {
>>>>>> pr_err("DRHD %Lx: failed to enable
>>>>>> fault, interrupt, ret
>>>> %d\n",
>>>>>> (unsigned long
>>>>>> long)drhd->reg_base_addr, ret);
>>>>>>
>>>>>> Thanks,
>>>>>> baolu
>>>>> We still see the issue with this change.
>>>> I am attempting to reproduce this issue with my MTL machine. I pulled
>>>> the test branch from:
>>>>
>>>> https://anongit.freedesktop.org/git/drm-tip.git
>>>>
>>>> and built the test kernel image using the configuration file from:
>>>>
>>>> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_16217/kconfig.txt
>>>>
>>>> But I did not observe the lockdep splat mentioned above after booting.
>>>>
>>>> Is there anything I might have missed?
>>>>
>>> +Suresh, Jani, Lucas
>>>
>>> We are seeing this only the skykale and kabylake on our CI runs.
>> If so, will below change make any difference?
>>
>> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
>> index 85aa66ef4d61..ec2f385ae25b 100644
>> --- a/drivers/iommu/intel/iommu.c
>> +++ b/drivers/iommu/intel/iommu.c
>> @@ -3049,6 +3049,7 @@ static int __init
>> probe_acpi_namespace_devices(void)
>> if (dev->bus != &acpi_bus_type)
>> continue;
>>
>> + up_read(&dmar_global_lock);
>> adev = to_acpi_device(dev);
>> mutex_lock(&adev->physical_node_lock);
>> list_for_each_entry(pn, @@ -3058,6 +3059,7 @@ static int __init
>> probe_acpi_namespace_devices(void)
>> break;
>> }
>> mutex_unlock(&adev->physical_node_lock);
>> + down_read(&dmar_global_lock);
>>
>> if (ret)
>> return ret;
>>
> Thank you for the change. This seems to be working. Can we expect a fix patch soon?
Sure. I have posted a fix patch here,
https://lore.kernel.org/linux-iommu/20250317035714.1041549-1-baolu.lu@linux.intel.com/
Thanks,
baolu
More information about the Intel-xe
mailing list