RE: 回复: [PATCH] drm/ttm: Put BO in its memory manager's lru list
Chen, Guchun
Guchun.Chen at amd.com
Wed Jan 12 02:19:38 UTC 2022
[Public]
Hi Christian,
My BAD, I checked that discussion history of this just now. So If I read it correctly, the double check at a different place to skip evict is: " drm/ttm: Double check mem_type of BO while eviction"? It is in 5.16 kernel.
Regards,
Guchun
-----Original Message-----
From: Christian König <ckoenig.leichtzumerken at gmail.com>
Sent: Tuesday, January 11, 2022 7:27 PM
To: Chen, Guchun <Guchun.Chen at amd.com>; Pan, Xinhui <Xinhui.Pan at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>; amd-gfx at lists.freedesktop.org
Cc: dri-devel at lists.freedesktop.org
Subject: Re: 回复: [PATCH] drm/ttm: Put BO in its memory manager's lru list
IIRC we have completely dropped this patch in favor of a check at a different place.
Regards,
Christian.
Am 11.01.22 um 09:47 schrieb Chen, Guchun:
> [Public]
>
> Hi Christian,
>
> Looks this patch still missed in 5.16 kernel. Is it intentional?
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.
> kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2
> Ftree%2Fdrivers%2Fgpu%2Fdrm%2Fttm%2Fttm_bo.c%3Fh%3Dv5.16&data=04%7
> C01%7CGuchun.Chen%40amd.com%7Cf3b7f4971dc8405b0c2908d9d4f55547%7C3dd89
> 61fe4884e608e11a82d994e183d%7C0%7C0%7C637774972434004088%7CUnknown%7CT
> WFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI
> 6Mn0%3D%7C3000&sdata=vbuBPHO40J2HGt7abzfzC0nC1DQa62qal5S6TXBRj4w%3
> D&reserved=0
>
> Regards,
> Guchun
>
> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of
> Pan, Xinhui
> Sent: Tuesday, November 9, 2021 9:16 PM
> To: Koenig, Christian <Christian.Koenig at amd.com>;
> amd-gfx at lists.freedesktop.org
> Cc: dri-devel at lists.freedesktop.org
> Subject: 回复: 回复: [PATCH] drm/ttm: Put BO in its memory manager's lru
> list
>
> [AMD Official Use Only]
>
> [AMD Official Use Only]
>
> Actually this patch does not totally fix the mismatch of lru list with mem_type as mem_type is changed in ->move() and lru list is changed after that.
>
> During this small period, another eviction could still happed and evict this mismatched BO from sMam(say, its lru list is on vram domain) to sMem.
> ________________________________________
> 发件人: Pan, Xinhui <Xinhui.Pan at amd.com>
> 发送时间: 2021年11月9日 21:05
> 收件人: Koenig, Christian; amd-gfx at lists.freedesktop.org
> 抄送: dri-devel at lists.freedesktop.org
> 主题: 回复: 回复: [PATCH] drm/ttm: Put BO in its memory manager's lru list
>
> Yes, a stable tag is needed. vulkan guys say 5.14 hit this issue too.
>
> I think that amdgpu_bo_move() does support copy from sysMem to sysMem correctly.
> maybe something below is needed.
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index c83ef42ca702..aa63ae7ddf1e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -485,7 +485,8 @@ static int amdgpu_bo_move(struct ttm_buffer_object *bo, bool evict,
> }
> if (old_mem->mem_type == TTM_PL_SYSTEM &&
> (new_mem->mem_type == TTM_PL_TT ||
> - new_mem->mem_type == AMDGPU_PL_PREEMPT)) {
> + new_mem->mem_type == AMDGPU_PL_PREEMPT ||
> + new_mem->mem_type == TTM_PL_SYSTEM)) {
> ttm_bo_move_null(bo, new_mem);
> goto out;
> }
>
> otherwise, amdgpu_move_blit() is called to do the system memory copy which use a wrong address.
> 206 /* Map only what can't be accessed directly */
> 207 if (!tmz && mem->start != AMDGPU_BO_INVALID_OFFSET) {
> 208 *addr = amdgpu_ttm_domain_start(adev, mem->mem_type) +
> 209 mm_cur->start;
> 210 return 0;
> 211 }
>
> line 208, *addr is zero. So when amdgpu_copy_buffer submit job with such addr, page fault happens.
>
>
> ________________________________________
> 发件人: Koenig, Christian <Christian.Koenig at amd.com>
> 发送时间: 2021年11月9日 20:35
> 收件人: Pan, Xinhui; amd-gfx at lists.freedesktop.org
> 抄送: dri-devel at lists.freedesktop.org
> 主题: Re: 回复: [PATCH] drm/ttm: Put BO in its memory manager's lru list
>
> Mhm, I'm not sure what the rational behind that is.
>
> Not moving the BO would make things less efficient, but should never cause a crash.
>
> Maybe we should add a CC: stable tag and push it to -fixes instead?
>
> Christian.
>
> Am 09.11.21 um 13:28 schrieb Pan, Xinhui:
>> [AMD Official Use Only]
>>
>> I hit vulkan cts test hang with navi23.
>>
>> dmesg says gmc page fault with address 0x0, 0x1000, 0x2000....
>> And some debug log also says amdgu copy one BO from system Domain to system Domain which is really weird.
>> ________________________________________
>> 发件人: Koenig, Christian <Christian.Koenig at amd.com>
>> 发送时间: 2021年11月9日 20:20
>> 收件人: Pan, Xinhui; amd-gfx at lists.freedesktop.org
>> 抄送: dri-devel at lists.freedesktop.org
>> 主题: Re: [PATCH] drm/ttm: Put BO in its memory manager's lru list
>>
>> Am 09.11.21 um 12:19 schrieb xinhui pan:
>>> After we move BO to a new memory region, we should put it to the new
>>> memory manager's lru list regardless we unlock the resv or not.
>>>
>>> Signed-off-by: xinhui pan <xinhui.pan at amd.com>
>> Interesting find, did you trigger that somehow or did you just
>> stumbled over it by reading the code?
>>
>> Patch is Reviewed-by: Christian König <christian.koenig at amd.com>, I
>> will pick that up for drm-misc-next.
>>
>> Thanks,
>> Christian.
>>
>>> ---
>>> drivers/gpu/drm/ttm/ttm_bo.c | 2 ++
>>> 1 file changed, 2 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c
>>> b/drivers/gpu/drm/ttm/ttm_bo.c index f1367107925b..e307004f0b28
>>> 100644
>>> --- a/drivers/gpu/drm/ttm/ttm_bo.c
>>> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
>>> @@ -701,6 +701,8 @@ int ttm_mem_evict_first(struct ttm_device *bdev,
>>> ret = ttm_bo_evict(bo, ctx);
>>> if (locked)
>>> ttm_bo_unreserve(bo);
>>> + else
>>> + ttm_bo_move_to_lru_tail_unlocked(bo);
>>>
>>> ttm_bo_put(bo);
>>> return ret;
More information about the amd-gfx
mailing list