commit 7ffb791423c7 breaks steam game
Balbir Singh
balbirs at nvidia.com
Mon Mar 24 21:43:54 UTC 2025
On 3/24/25 22:23, Bert Karwatzki wrote:
> Am Sonntag, dem 23.03.2025 um 17:51 +1100 schrieb Balbir Singh:
>> On 3/22/25 23:23, Bert Karwatzki wrote:
>>> The problem occurs in this part of ttm_tt_populate(), in the nokaslr case
>>> the loop is entered and repeatedly run because ttm_dma32_pages allocated exceeds
>>> the ttm_dma32_pages_limit which leads to lots of calls to ttm_global_swapout().
>>>
>>> if (!strcmp(get_current()->comm, "stellaris"))
>>> printk(KERN_INFO "%s: ttm_pages_allocated=0x%llx ttm_pages_limit=0x%lx ttm_dma32_pages_allocated=0x%llx ttm_dma32_pages_limit=0x%lx\n",
>>> __func__, ttm_pages_allocated.counter, ttm_pages_limit, ttm_dma32_pages_allocated.counter, ttm_dma32_pages_limit);
>>> while (atomic_long_read(&ttm_pages_allocated) > ttm_pages_limit ||
>>> atomic_long_read(&ttm_dma32_pages_allocated) >
>>> ttm_dma32_pages_limit) {
>>>
>>> if (!strcmp(get_current()->comm, "stellaris"))
>>> printk(KERN_INFO "%s: count=%d ttm_pages_allocated=0x%llx ttm_pages_limit=0x%lx ttm_dma32_pages_allocated=0x%llx ttm_dma32_pages_limit=0x%lx\n",
>>> __func__, count++, ttm_pages_allocated.counter, ttm_pages_limit, ttm_dma32_pages_allocated.counter, ttm_dma32_pages_limit);
>>> ret = ttm_global_swapout(ctx, GFP_KERNEL);
>>> if (ret == 0)
>>> break;
>>> if (ret < 0)
>>> goto error;
>>> }
>>>
>>> In the case without nokaslr on the number of ttm_dma32_pages_allocated is 0 because
>>> use_dma32 == false in this case.
>>>
>>> So why is use_dma32 enabled with nokaslr? Some more printk()s give this result:
>>>
>>> The GPUs:
>>> built-in:
>>> 08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] (rev c5)
>>> discrete:
>>> 03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] (rev c3)
>>>
>>> With nokaslr:
>>> [ 1.266517] [ T328] dma_addressing_limited: mask = 0xfffffffffff bus_dma_limit = 0x0 required_mask = 0xfffffffff
>>> [ 1.266519] [ T328] dma_addressing_limited: ops = 0000000000000000 use_dma_iommu(dev) = 0
>>> [ 1.266520] [ T328] dma_direct_all_ram_mapped: returning true
>>> [ 1.266521] [ T328] dma_addressing_limited: returning ret = 0
>>> [ 1.266521] [ T328] amdgpu 0000:03:00.0: amdgpu: amdgpu_ttm_init: calling ttm_device_init() with use_dma32 = 0
>>> [ 1.266525] [ T328] entering ttm_device_init, use_dma32 = 0
>>> [ 1.267115] [ T328] entering ttm_pool_init, use_dma32 = 0
>>>
>>> [ 3.965669] [ T328] dma_addressing_limited: mask = 0xfffffffffff bus_dma_limit = 0x0 required_mask = 0x3fffffffffff
>>> [ 3.965671] [ T328] dma_addressing_limited: returning true
>>> [ 3.965672] [ T328] amdgpu 0000:08:00.0: amdgpu: amdgpu_ttm_init: calling ttm_device_init() with use_dma32 = 1
>>> [ 3.965674] [ T328] entering ttm_device_init, use_dma32 = 1
>>> [ 3.965747] [ T328] entering ttm_pool_init, use_dma32 = 1
>>>
>>> Without nokaslr:
>>> [ 1.300907] [ T351] dma_addressing_limited: mask = 0xfffffffffff bus_dma_limit = 0x0 required_mask = 0xfffffffff
>>> [ 1.300909] [ T351] dma_addressing_limited: ops = 0000000000000000 use_dma_iommu(dev) = 0
>>> [ 1.300910] [ T351] dma_direct_all_ram_mapped: returning true
>>> [ 1.300910] [ T351] dma_addressing_limited: returning ret = 0
>>> [ 1.300911] [ T351] amdgpu 0000:03:00.0: amdgpu: amdgpu_ttm_init: calling ttm_device_init() with use_dma32 = 0
>>> [ 1.300915] [ T351] entering ttm_device_init, use_dma32 = 0
>>> [ 1.301210] [ T351] entering ttm_pool_init, use_dma32 = 0
>>>
>>> [ 4.000602] [ T351] dma_addressing_limited: mask = 0xfffffffffff bus_dma_limit = 0x0 required_mask = 0xfffffffffff
>>> [ 4.000603] [ T351] dma_addressing_limited: ops = 0000000000000000 use_dma_iommu(dev) = 0
>>> [ 4.000604] [ T351] dma_direct_all_ram_mapped: returning true
>>> [ 4.000605] [ T351] dma_addressing_limited: returning ret = 0
>>> [ 4.000606] [ T351] amdgpu 0000:08:00.0: amdgpu: amdgpu_ttm_init: calling ttm_device_init() with use_dma32 = 0
>>> [ 4.000610] [ T351] entering ttm_device_init, use_dma32 = 0
>>> [ 4.000687] [ T351] entering ttm_pool_init, use_dma32 = 0
>>>
>>> So with nokaslr the reuqired mask for the built-in GPU changes from 0xfffffffffff
>>> to 0x3fffffffffff which causes dma_addressing_limited to return true which causes
>>> the ttm_device init to be called with use_dma32 = true.
>>
>> Thanks, this is really the root cause, from what I understand.
>>
>>> It also show that for the discreate GPU nothing changes so the bug does not occur
>>> there.
>>>
>>> I also was able to work around the bug by calling ttm_device_init() with use_dma32=false
>>> from amdgpu_ttm_init() (drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c) but I'm not sure if this
>>> has unwanted side effects.
>>>
>>> int amdgpu_ttm_init(struct amdgpu_device *adev)
>>> {
>>> uint64_t gtt_size;
>>> int r;
>>>
>>> mutex_init(&adev->mman.gtt_window_lock);
>>>
>>> dma_set_max_seg_size(adev->dev, UINT_MAX);
>>> /* No others user of address space so set it to 0 */
>>> dev_info(adev->dev, "%s: calling ttm_device_init() with use_dma32 = 0 ignoring %d\n", __func__, dma_addressing_limited(adev->dev));
>>> r = ttm_device_init(&adev->mman.bdev, &amdgpu_bo_driver, adev->dev,
>>> adev_to_drm(adev)->anon_inode->i_mapping,
>>> adev_to_drm(adev)->vma_offset_manager,
>>> adev->need_swiotlb,
>>> false /* use_dma32 */);
>>> if (r) {
>>> DRM_ERROR("failed initializing buffer object driver(%d).\n", r);
>>> return r;
>>> }
>>>
>>
>> I think this brings us really close, instead of forcing use_dma32 to false, I wonder if we need something like
>>
>> uin64_t dma_bits = fls64(dma_get_mask(adev->dev));
>>
>> to ttm_device_init, pass the last argument (use_dma32) as dma_bits < 32?
>>
>>
>> Thanks,
>> Balbir Singh
>>
>
> Do these address bits have to shift when using nokaslr or PCI_P2PDMA, I think
> this shift cause the increase of the required_dma_mask to 0x3fffffffffff?
>
That depends on dma ops, as per dma-api.rst
"
dma_get_required_mask(struct device *dev)
This API returns the mask that the platform requires to
operate efficiently. Usually this means the returned mask
is the minimum required to cover all of memory."
I think the assumption that dma_addressing_limited(), due to dma_mask
for the device being smaller/shorter than required_mask implies dma32
= true, is incorrect.
> @@ -104,4 +104,4 @@
> fe30300000-fe303fffff : 0000:04:00.0
> fe30400000-fe30403fff : 0000:04:00.0
> fe30404000-fe30404fff : 0000:04:00.0
> -afe00000000-affffffffff : 0000:03:00.0
> +3ffe00000000-3fffffffffff : 0000:03:00.0
>
> And what memory is this? It's 8G in size so it could be the RAM of the discrete
> GPU (which is at PCI 0000:03:00.0), but that is already here (part of
> /proc/iomem):
>
>
I think the mask is independent of what is mapped there, all it says it
it needs to address upto 46 bits in the mask
Balbir Singh
More information about the amd-gfx
mailing list