Re: 回复: 回复: 回复: [PATCH] drm/ttm: Put BO in its memory manager's lru list

Christian König christian.koenig at amd.com
Tue Nov 9 13:59:12 UTC 2021


In general the correct idea, but the wrong place to check that.

Calling amdgpu_ttm_bo_eviction_valuable() is only optional, but that 
check must be mandatory for correct operation.

This needs to be inside ttm_bo_evict_swapout_allowable().

Christian.

Am 09.11.21 um 14:41 schrieb Pan, Xinhui:
> [AMD Official Use Only]
>
> yes, a double check is needed.
> how about change below.
>
> As long as we detect such mismatch, it indicates another eviction is on going. return false here is reasonable.
>
> @@ -1335,6 +1336,8 @@ static bool amdgpu_ttm_bo_eviction_valuable(struct ttm_buffer_object *bo,
>          struct dma_fence *f;
>          int i;
>
> +       if (bo->resource->mem_type != place->mem_type)
> +               return false;
>          /* Swapout? */
>          if (bo->resource->mem_type == TTM_PL_SYSTEM)
>                  return true;
>
> ________________________________________
> 发件人: Koenig, Christian <Christian.Koenig at amd.com>
> 发送时间: 2021年11月9日 21:18
> 收件人: Pan, Xinhui; amd-gfx at lists.freedesktop.org
> 抄送: dri-devel at lists.freedesktop.org
> 主题: Re: 回复: 回复: [PATCH] drm/ttm: Put BO in its memory manager's lru list
>
> Exactly that's the reason why we should have the double check in TTM
> I've mentioned in the other mail.
>
> Christian.
>
> Am 09.11.21 um 14:16 schrieb Pan, Xinhui:
>> [AMD Official Use Only]
>>
>> Actually this patch does not totally fix the mismatch of lru list with mem_type as mem_type is changed in ->move() and lru list is changed after that.
>>
>> During this small period, another eviction could still happed and evict this mismatched BO from sMam(say, its lru list is on vram domain) to sMem.
>> ________________________________________
>> 发件人: Pan, Xinhui <Xinhui.Pan at amd.com>
>> 发送时间: 2021年11月9日 21:05
>> 收件人: Koenig, Christian; amd-gfx at lists.freedesktop.org
>> 抄送: dri-devel at lists.freedesktop.org
>> 主题: 回复: 回复: [PATCH] drm/ttm: Put BO in its memory manager's lru list
>>
>> Yes, a stable tag is needed. vulkan guys say 5.14 hit this issue too.
>>
>> I think that amdgpu_bo_move() does support copy from sysMem to sysMem correctly.
>> maybe something below is needed.
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> index c83ef42ca702..aa63ae7ddf1e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> @@ -485,7 +485,8 @@ static int amdgpu_bo_move(struct ttm_buffer_object *bo, bool evict,
>>           }
>>           if (old_mem->mem_type == TTM_PL_SYSTEM &&
>>               (new_mem->mem_type == TTM_PL_TT ||
>> -            new_mem->mem_type == AMDGPU_PL_PREEMPT)) {
>> +            new_mem->mem_type == AMDGPU_PL_PREEMPT ||
>> +            new_mem->mem_type == TTM_PL_SYSTEM)) {
>>                   ttm_bo_move_null(bo, new_mem);
>>                   goto out;
>>           }
>>
>> otherwise, amdgpu_move_blit() is called to do the system memory copy which use a wrong address.
>>    206         /* Map only what can't be accessed directly */
>>    207         if (!tmz && mem->start != AMDGPU_BO_INVALID_OFFSET) {
>>    208                 *addr = amdgpu_ttm_domain_start(adev, mem->mem_type) +
>>    209                         mm_cur->start;
>>    210                 return 0;
>>    211         }
>>
>> line 208, *addr is zero. So when amdgpu_copy_buffer submit job with such addr, page fault happens.
>>
>>
>> ________________________________________
>> 发件人: Koenig, Christian <Christian.Koenig at amd.com>
>> 发送时间: 2021年11月9日 20:35
>> 收件人: Pan, Xinhui; amd-gfx at lists.freedesktop.org
>> 抄送: dri-devel at lists.freedesktop.org
>> 主题: Re: 回复: [PATCH] drm/ttm: Put BO in its memory manager's lru list
>>
>> Mhm, I'm not sure what the rational behind that is.
>>
>> Not moving the BO would make things less efficient, but should never
>> cause a crash.
>>
>> Maybe we should add a CC: stable tag and push it to -fixes instead?
>>
>> Christian.
>>
>> Am 09.11.21 um 13:28 schrieb Pan, Xinhui:
>>> [AMD Official Use Only]
>>>
>>> I hit vulkan cts test hang with navi23.
>>>
>>> dmesg says gmc page fault with address 0x0, 0x1000, 0x2000....
>>> And some debug log also says amdgu copy one BO from system Domain to system Domain which is really weird.
>>> ________________________________________
>>> 发件人: Koenig, Christian <Christian.Koenig at amd.com>
>>> 发送时间: 2021年11月9日 20:20
>>> 收件人: Pan, Xinhui; amd-gfx at lists.freedesktop.org
>>> 抄送: dri-devel at lists.freedesktop.org
>>> 主题: Re: [PATCH] drm/ttm: Put BO in its memory manager's lru list
>>>
>>> Am 09.11.21 um 12:19 schrieb xinhui pan:
>>>> After we move BO to a new memory region, we should put it to
>>>> the new memory manager's lru list regardless we unlock the resv or not.
>>>>
>>>> Signed-off-by: xinhui pan <xinhui.pan at amd.com>
>>> Interesting find, did you trigger that somehow or did you just stumbled
>>> over it by reading the code?
>>>
>>> Patch is Reviewed-by: Christian König <christian.koenig at amd.com>, I will
>>> pick that up for drm-misc-next.
>>>
>>> Thanks,
>>> Christian.
>>>
>>>> ---
>>>>      drivers/gpu/drm/ttm/ttm_bo.c | 2 ++
>>>>      1 file changed, 2 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
>>>> index f1367107925b..e307004f0b28 100644
>>>> --- a/drivers/gpu/drm/ttm/ttm_bo.c
>>>> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
>>>> @@ -701,6 +701,8 @@ int ttm_mem_evict_first(struct ttm_device *bdev,
>>>>          ret = ttm_bo_evict(bo, ctx);
>>>>          if (locked)
>>>>                  ttm_bo_unreserve(bo);
>>>> +     else
>>>> +             ttm_bo_move_to_lru_tail_unlocked(bo);
>>>>
>>>>          ttm_bo_put(bo);
>>>>          return ret;



More information about the amd-gfx mailing list