回复: 回复: [PATCH] drm/ttm: Put BO in its memory manager's lru list

Pan, Xinhui Xinhui.Pan at amd.com
Tue Nov 9 13:05:36 UTC 2021


[AMD Official Use Only]

Yes, a stable tag is needed. vulkan guys say 5.14 hit this issue too.

I think that amdgpu_bo_move() does support copy from sysMem to sysMem correctly.
maybe something below is needed.
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index c83ef42ca702..aa63ae7ddf1e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -485,7 +485,8 @@ static int amdgpu_bo_move(struct ttm_buffer_object *bo, bool evict,
        }
        if (old_mem->mem_type == TTM_PL_SYSTEM &&
            (new_mem->mem_type == TTM_PL_TT ||
-            new_mem->mem_type == AMDGPU_PL_PREEMPT)) {
+            new_mem->mem_type == AMDGPU_PL_PREEMPT ||
+            new_mem->mem_type == TTM_PL_SYSTEM)) {
                ttm_bo_move_null(bo, new_mem);
                goto out;
        }

otherwise, amdgpu_move_blit() is called to do the system memory copy which use a wrong address.
 206         /* Map only what can't be accessed directly */
 207         if (!tmz && mem->start != AMDGPU_BO_INVALID_OFFSET) {
 208                 *addr = amdgpu_ttm_domain_start(adev, mem->mem_type) +
 209                         mm_cur->start;
 210                 return 0;
 211         }

line 208, *addr is zero. So when amdgpu_copy_buffer submit job with such addr, page fault happens.


________________________________________
发件人: Koenig, Christian <Christian.Koenig at amd.com>
发送时间: 2021年11月9日 20:35
收件人: Pan, Xinhui; amd-gfx at lists.freedesktop.org
抄送: dri-devel at lists.freedesktop.org
主题: Re: 回复: [PATCH] drm/ttm: Put BO in its memory manager's lru list

Mhm, I'm not sure what the rational behind that is.

Not moving the BO would make things less efficient, but should never
cause a crash.

Maybe we should add a CC: stable tag and push it to -fixes instead?

Christian.

Am 09.11.21 um 13:28 schrieb Pan, Xinhui:
> [AMD Official Use Only]
>
> I hit vulkan cts test hang with navi23.
>
> dmesg says gmc page fault with address 0x0, 0x1000, 0x2000....
> And some debug log also says amdgu copy one BO from system Domain to system Domain which is really weird.
> ________________________________________
> 发件人: Koenig, Christian <Christian.Koenig at amd.com>
> 发送时间: 2021年11月9日 20:20
> 收件人: Pan, Xinhui; amd-gfx at lists.freedesktop.org
> 抄送: dri-devel at lists.freedesktop.org
> 主题: Re: [PATCH] drm/ttm: Put BO in its memory manager's lru list
>
> Am 09.11.21 um 12:19 schrieb xinhui pan:
>> After we move BO to a new memory region, we should put it to
>> the new memory manager's lru list regardless we unlock the resv or not.
>>
>> Signed-off-by: xinhui pan <xinhui.pan at amd.com>
> Interesting find, did you trigger that somehow or did you just stumbled
> over it by reading the code?
>
> Patch is Reviewed-by: Christian König <christian.koenig at amd.com>, I will
> pick that up for drm-misc-next.
>
> Thanks,
> Christian.
>
>> ---
>>    drivers/gpu/drm/ttm/ttm_bo.c | 2 ++
>>    1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
>> index f1367107925b..e307004f0b28 100644
>> --- a/drivers/gpu/drm/ttm/ttm_bo.c
>> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
>> @@ -701,6 +701,8 @@ int ttm_mem_evict_first(struct ttm_device *bdev,
>>        ret = ttm_bo_evict(bo, ctx);
>>        if (locked)
>>                ttm_bo_unreserve(bo);
>> +     else
>> +             ttm_bo_move_to_lru_tail_unlocked(bo);
>>
>>        ttm_bo_put(bo);
>>        return ret;



More information about the dri-devel mailing list