[Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915: stop using swiotlb (rev6)

Thu Jul 28 15:54:21 UTC 2022

On 28/07/2022 15:03, Tvrtko Ursulin wrote:
> 
> On 28/07/2022 09:01, Patchwork wrote:
> 
> [snip]
> 
>>         Possible regressions
>>
>>   * igt at gem_mmap_offset@clear:
>>       o shard-iclb: PASS
>>         
>> <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11946/shard-iclb6/igt@gem_mmap_offset@clear.html> 
>>
>>         -> INCOMPLETE
>>         
>> <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_106589v6/shard-iclb1/igt@gem_mmap_offset@clear.html> 
>>
> 
> What was supposed to be a simple patch.. a storm of errors like:

yeah, them's the breaks sometimes ....

> 
>   DMAR: ERROR: DMA PTE for vPFN 0x3d00000 already set (to 2fd7ff003 not 
> 2fd7ff003)
>   ------------[ cut here ]------------
>   WARNING: CPU: 6 PID: 1254 at drivers/iommu/intel/iommu.c:2278 
> __domain_mapping.cold.93+0x32/0x39<>
>   Modules linked in: vgem drm_shmem_helper snd_hda_codec_hdmi 
> snd_hda_codec_realtek snd_hda_cod>
>   CPU: 6 PID: 1254 Comm: gem_mmap_offset Not tainted 
> 5.19.0-rc8-Patchwork_106589v6-g0e9c43d76a14+ #>
>   Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U 
> DDR4 SODIMM PD RVP TLC, BIOS >
>   RIP: 0010:__domain_mapping.cold.93+0x32/0x39
>   Code: fe 48 c7 c7 28 32 37 82 4c 89 5c 24 08 e8 e4 61 fd ff 8b 05 bf 
> 8e c9 00 4c 8b 5c 24 08 85 c>
>   RSP: 0000:ffffc9000037f9c0 EFLAGS: 00010202
>   RAX: 0000000000000004 RBX: ffff8881117b4000 RCX: 0000000000000001
>   RDX: 0000000000000000 RSI: ffffffff82320b25 RDI: 00000000ffffffff
>   RBP: 0000000000000001 R08: 0000000000000000 R09: c0000000ffff7fff
>   R10: 0000000000000001 R11: 00000000002fd7ff R12: 00000002fd7ff003
>   R13: 0000000000076c01 R14: ffff8881039ee800 R15: 0000000003d00000
>   FS:  00007f2863c1d700(0000) GS:ffff88849fd00000(0000) 
> knlGS:0000000000000000
>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   CR2: 00007f2692c53000 CR3: 000000011c440006 CR4: 0000000000770ee0
>   PKRU: 55555554
>   Call Trace:
>    <TASK>
>    intel_iommu_map_pages+0xb7/0xe0
>    __iommu_map+0xe0/0x310
>    __iommu_map_sg+0xa2/0x140
>    iommu_dma_map_sg+0x2ef/0x4e0
>    __dma_map_sg_attrs+0x64/0x70
>    dma_map_sg_attrs+0x5/0x20
>    i915_gem_gtt_prepare_pages+0x56/0x70 [i915]
>    shmem_get_pages+0xe3/0x360 [i915]
>    ____i915_gem_object_get_pages+0x32/0x100 [i915]
>    __i915_gem_object_get_pages+0x8d/0xa0 [i915]
>    vm_fault_gtt+0x3d0/0x940 [i915]
>    ? ptlock_alloc+0x15/0x40
>    ? rt_mutex_debug_task_free+0x91/0xa0
>    __do_fault+0x30/0x180
>    do_fault+0x1c4/0x4c0
>    __handle_mm_fault+0x615/0xbe0
>    handle_mm_fault+0x75/0x1c0
>    do_user_addr_fault+0x1e7/0x670
>    exc_page_fault+0x62/0x230
>    asm_exc_page_fault+0x22/0x30
> 
> No idea. Maybe try CI kernel config on your Tigerlake?

I have an idea of what could be happening:

The warning is due to a pte already existing. We can see from the 
warning that it is the same value, which indicates that the same page 
has been mapped to the same iova before.

This map shrink loop will keep mapping the same sg, shrinking if it 
fails to hopefully free up iova space.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/i915/i915_gem_gtt.c?h=v5.19-rc8#n32

If we now look at the intel iommu driver's mapping function:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/iommu/intel/iommu.c?h=v5.19-rc8#n2248

If that -ENOMEM loop breaking return is hit (presumably running out of 
pte space, though I have not delved deeper), then it will return back up 
the stack, eventually returning 0 from dma_map_sg_attrs() indicating the 
error. This will cause a shrink and retry.

The problem is that the iommu does not undo it's partial mapping on 
error. So the next time round, it will map the same page to the same 
address giving the same pte encoding, which would give the warning observed.

I would need to get some time to try to repro and debug to confirm, but 
this looks like it might be exposing an iommu driver issue due to us 
changing our mapping patterns because the segment sizes are now different.

I'll see if I can get some time allotted to debug it further, but for 
now, I don't have the bandwidth, so this may need to go on hold until I 
or someone else can get time to look in to it.

> 
> Regards,
> 
> Tvrtko