[PATCH 1/3] drm/amdgpu: fix gtt mgr available statistics

Thu Apr 20 08:54:01 UTC 2017

Am 20.04.2017 um 05:10 schrieb zhoucm1:
>
>
> On 2017年04月19日 17:40, Christian König wrote:
>> Am 19.04.2017 um 11:15 schrieb zhoucm1:
>>>
>>>
>>> On 2017年04月19日 14:59, Christian König wrote:
>>>> Am 19.04.2017 um 08:52 schrieb zhoucm1:
>>>>>
>>>>>
>>>>> On 2017年04月19日 14:38, Christian König wrote:
>>>>>> Am 19.04.2017 um 05:50 schrieb Chunming Zhou:
>>>>>>> gtt_mgr_alloc is called by many places in local driver, while
>>>>>>> gtt_mgr_new is called by get_node in ttm.
>>>>>>
>>>>>> NAK, that can lead to over allocating the address space and we 
>>>>>> can't handle that during suspend/resume.
>>>>> I didn't get your mean here.
>>>>>
>>>>> Describe it first from my side, I found this issue on APU like 
>>>>> carrizo, vram is little, so gtt is used much more, we always found 
>>>>> the gtt cannot be used end up and results in Game failed to 
>>>>> allocate memory, but from dumping mm hole table, there are many 
>>>>> free hole and memory.
>>>>
>>>> In this case you need to increase the gartsize parameter to the 
>>>> kernel module. What is the default size chosen for this? Maybe we 
>>>> need to adjust that default.
>>>>
>>>>> The root cause is mgr->available statistic is incorrect, which 
>>>>> doesn't match dump mm table.
>>>>
>>>> That's actually correct. 
>>> There will be a problem here, memory not mapped in mm table won't be 
>>> able to evict by TTM.
>>
>> Why do you think so? It's possible that we have a bug somewhere 
>> regarding this, but at least in theory that shouldn't be the case.
> With more thinking and experiments, I think you're right, with my 
> patch, gtt mgr will be able to stolen memory without right limitation 
> so that game can allocate more than gtt size memory, so game can run.
> Will just increase gtt size to 3G by default, and add precise gtt 
> print, some our script tool is basing on it.

Wait a second with that. I had more time thinking about it and you might 
indeed have stumbled over something:

1. In amdgpu_evict_flags() we set an upper limit to force directly 
allocation of the BO.
2. In amdgpu_gtt_mgr_alloc() we set the node start to 
AMDGPU_BO_INVALID_OFFSET if we don't map the BO into GART.
3. In amdgpu_ttm_bo_eviction_valuable() we check if the BO in the LRU 
matches the requested limit.

So what happens is that not mapped BOs are never evicted correctly when 
we run out of GART memory during eviction from VRAM!

Give me a second to hack a patch for this.

Regards,
Christian.

>
> Thanks,
> David Zhou
>
>>
>>> The root cause is here, we need to find a solution for it.
>>
>> Yeah, that sounds like the handling is buggy somehow.
>>
>> Regards,
>> Christian.
>>
>>> And yes, I agree your other opinions.
>>>
>>> Regards,
>>> David Zhou
>>>> We don't map all BOs in the GART domain into the actual GART table 
>>>> to avoid all the table operations.
>>>>
>>>>> I think mgr->available should only be changed when insert/free node.
>>>>
>>>> No, on suspend/resume we need to be able to add all BOs into the 
>>>> GART table. So with your change suspend/resume can potentially fail.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>> David Zhou
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>
>>>>>>> Change-Id: Ia5a18a3b531a01ad7d47f40e08f778e7b94c048a
>>>>>>> Signed-off-by: Chunming Zhou <David1.Zhou at amd.com>
>>>>>>> ---
>>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c | 11 +++++------
>>>>>>>   1 file changed, 5 insertions(+), 6 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c 
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
>>>>>>> index 69ab2ee..8a950a5 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
>>>>>>> @@ -124,6 +124,8 @@ int amdgpu_gtt_mgr_alloc(struct 
>>>>>>> ttm_mem_type_manager *man,
>>>>>>>       r = drm_mm_insert_node_in_range_generic(&mgr->mm, node, 
>>>>>>> mem->num_pages,
>>>>>>>                           mem->page_alignment, 0,
>>>>>>>                           fpfn, lpfn, sflags, aflags);
>>>>>>> +    if (!r)
>>>>>>> +        mgr->available -= mem->num_pages;
>>>>>>>       spin_unlock(&mgr->lock);
>>>>>>>         if (!r) {
>>>>>>> @@ -160,7 +162,6 @@ static int amdgpu_gtt_mgr_new(struct 
>>>>>>> ttm_mem_type_manager *man,
>>>>>>>           spin_unlock(&mgr->lock);
>>>>>>>           return 0;
>>>>>>>       }
>>>>>>> -    mgr->available -= mem->num_pages;
>>>>>>>       spin_unlock(&mgr->lock);
>>>>>>>         node = kzalloc(sizeof(*node), GFP_KERNEL);
>>>>>>> @@ -187,9 +188,6 @@ static int amdgpu_gtt_mgr_new(struct 
>>>>>>> ttm_mem_type_manager *man,
>>>>>>>         return 0;
>>>>>>>   err_out:
>>>>>>> -    spin_lock(&mgr->lock);
>>>>>>> -    mgr->available += mem->num_pages;
>>>>>>> -    spin_unlock(&mgr->lock);
>>>>>>>         return r;
>>>>>>>   }
>>>>>>> @@ -214,9 +212,10 @@ static void amdgpu_gtt_mgr_del(struct 
>>>>>>> ttm_mem_type_manager *man,
>>>>>>>           return;
>>>>>>>         spin_lock(&mgr->lock);
>>>>>>> -    if (node->start != AMDGPU_BO_INVALID_OFFSET)
>>>>>>> +    if (node->start != AMDGPU_BO_INVALID_OFFSET) {
>>>>>>>           drm_mm_remove_node(node);
>>>>>>> -    mgr->available += mem->num_pages;
>>>>>>> +        mgr->available += mem->num_pages;
>>>>>>> +    }
>>>>>>>       spin_unlock(&mgr->lock);
>>>>>>>         kfree(node);
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx at lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>
>>>>
>>>
>>
>