[Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915: stop using swiotlb (rev6)

Thu Jul 28 16:07:08 UTC 2022

On 28/07/2022 16:54, Robert Beckett wrote:
> On 28/07/2022 15:03, Tvrtko Ursulin wrote:
>>
>> On 28/07/2022 09:01, Patchwork wrote:
>>
>> [snip]
>>
>>>         Possible regressions
>>>
>>>   * igt at gem_mmap_offset@clear:
>>>       o shard-iclb: PASS
>>> <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11946/shard-iclb6/igt@gem_mmap_offset@clear.html> 
>>>
>>>         -> INCOMPLETE
>>> <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_106589v6/shard-iclb1/igt@gem_mmap_offset@clear.html> 
>>>
>>
>> What was supposed to be a simple patch.. a storm of errors like:
> 
> yeah, them's the breaks sometimes ....
> 
>>
>>   DMAR: ERROR: DMA PTE for vPFN 0x3d00000 already set (to 2fd7ff003 
>> not 2fd7ff003)
>>   ------------[ cut here ]------------
>>   WARNING: CPU: 6 PID: 1254 at drivers/iommu/intel/iommu.c:2278 
>> __domain_mapping.cold.93+0x32/0x39<>
>>   Modules linked in: vgem drm_shmem_helper snd_hda_codec_hdmi 
>> snd_hda_codec_realtek snd_hda_cod>
>>   CPU: 6 PID: 1254 Comm: gem_mmap_offset Not tainted 
>> 5.19.0-rc8-Patchwork_106589v6-g0e9c43d76a14+ #>
>>   Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U 
>> DDR4 SODIMM PD RVP TLC, BIOS >
>>   RIP: 0010:__domain_mapping.cold.93+0x32/0x39
>>   Code: fe 48 c7 c7 28 32 37 82 4c 89 5c 24 08 e8 e4 61 fd ff 8b 05 bf 
>> 8e c9 00 4c 8b 5c 24 08 85 c>
>>   RSP: 0000:ffffc9000037f9c0 EFLAGS: 00010202
>>   RAX: 0000000000000004 RBX: ffff8881117b4000 RCX: 0000000000000001
>>   RDX: 0000000000000000 RSI: ffffffff82320b25 RDI: 00000000ffffffff
>>   RBP: 0000000000000001 R08: 0000000000000000 R09: c0000000ffff7fff
>>   R10: 0000000000000001 R11: 00000000002fd7ff R12: 00000002fd7ff003
>>   R13: 0000000000076c01 R14: ffff8881039ee800 R15: 0000000003d00000
>>   FS:  00007f2863c1d700(0000) GS:ffff88849fd00000(0000) 
>> knlGS:0000000000000000
>>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>   CR2: 00007f2692c53000 CR3: 000000011c440006 CR4: 0000000000770ee0
>>   PKRU: 55555554
>>   Call Trace:
>>    <TASK>
>>    intel_iommu_map_pages+0xb7/0xe0
>>    __iommu_map+0xe0/0x310
>>    __iommu_map_sg+0xa2/0x140
>>    iommu_dma_map_sg+0x2ef/0x4e0
>>    __dma_map_sg_attrs+0x64/0x70
>>    dma_map_sg_attrs+0x5/0x20
>>    i915_gem_gtt_prepare_pages+0x56/0x70 [i915]
>>    shmem_get_pages+0xe3/0x360 [i915]
>>    ____i915_gem_object_get_pages+0x32/0x100 [i915]
>>    __i915_gem_object_get_pages+0x8d/0xa0 [i915]
>>    vm_fault_gtt+0x3d0/0x940 [i915]
>>    ? ptlock_alloc+0x15/0x40
>>    ? rt_mutex_debug_task_free+0x91/0xa0
>>    __do_fault+0x30/0x180
>>    do_fault+0x1c4/0x4c0
>>    __handle_mm_fault+0x615/0xbe0
>>    handle_mm_fault+0x75/0x1c0
>>    do_user_addr_fault+0x1e7/0x670
>>    exc_page_fault+0x62/0x230
>>    asm_exc_page_fault+0x22/0x30
>>
>> No idea. Maybe try CI kernel config on your Tigerlake?
> 
> I have an idea of what could be happening:
> 
> The warning is due to a pte already existing. We can see from the 
> warning that it is the same value, which indicates that the same page 
> has been mapped to the same iova before.
> 
> This map shrink loop will keep mapping the same sg, shrinking if it 
> fails to hopefully free up iova space.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/i915/i915_gem_gtt.c?h=v5.19-rc8#n32 
> 
> 
> If we now look at the intel iommu driver's mapping function:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/iommu/intel/iommu.c?h=v5.19-rc8#n2248 
> 
> 
> If that -ENOMEM loop breaking return is hit (presumably running out of 
> pte space, though I have not delved deeper), then it will return back up 
> the stack, eventually returning 0 from dma_map_sg_attrs() indicating the 
> error. This will cause a shrink and retry.
> 
> The problem is that the iommu does not undo it's partial mapping on 
> error. So the next time round, it will map the same page to the same 
> address giving the same pte encoding, which would give the warning 
> observed.
> 
> I would need to get some time to try to repro and debug to confirm, but 
> this looks like it might be exposing an iommu driver issue due to us 
> changing our mapping patterns because the segment sizes are now different.
> 
> I'll see if I can get some time allotted to debug it further, but for 
> now, I don't have the bandwidth, so this may need to go on hold until I 
> or someone else can get time to look in to it.

Yeah that's understandable. I also currently don't have any free 
bandwidth unfortunately.

+ Christoph FYI, as per above, swiotlb API usage removal is currently a 
bit stuck until we find someone with some spare time to debug this further.

Regards,

Tvrtko