[PATCH] [RFC]drm/ttm: fix scheduling balance

Thu Jan 25 17:30:26 UTC 2018

Am 25.01.2018 um 17:47 schrieb Thomas Hellstrom:
> On 01/25/2018 03:57 PM, Thomas Hellstrom wrote:
>> On 01/25/2018 10:59 AM, Chunming Zhou wrote:
>>> there is a scheduling balance issue about get node like:
>>> a. process A allocates full memory and use it for submission.
>>> b. process B tries to allocates memory, will wait for process A BO 
>>> idle in eviction.
>>> c. process A completes the job, process B eviction will put process 
>>> A BO node,
>>> but in the meantime, process C is comming to allocate BO, whill 
>>> directly get node successfully, and do submission,
>>> process B will again wait for process C BO idle.
>>> d. repeat the above setps, process B could be delayed much more.
>>>
>>> add a mutex to gerantee the allocation sequence for same domain. But 
>>> there is a possibility that
>>> visible vram could be evicted to invisilbe, the tricky is they are 
>>> same domain manager, so which needs a special handling.
>>>
>>> Change-Id: I260e8eb704f7b4788b071d3f641f21b242912256
>>> Signed-off-by: Chunming Zhou <david1.zhou at amd.com>
>>
>> I think this is a good approach, however there are two things that 
>> IMO needs fixing. [...]
>
> Thinking a bit more about this, the end result would be that typical 
> "C" processes would get an unfair amount of GPU scheduling.
> Isn't it actually a scheduler's task outside of TTM to mitigate this?

Yes, exactly the reason why I rejected this. I actually considered 
moving the whole evicting to a background workitem.

> Further, TTM has had a design principle of avoiding locks held while 
> waiting for GPU, with the exception of buffer object reservations,
> I think this would be the first violator, but a fairly harmless one.

At least amdgpu normally doesn't block for any GPU operation to finish 
(with a few exception), but yes I see the problem as well.

>
> I can see the use for it though. It would also allow scanning the LRU 
> lists for a suitable set of buffer objects to evict, rather than 
> evicting in strict LRU order...

At least for amdgpu that won't be possible even then, cause we don't 
tell TTM everything about buffer placement. E.g. BOs are not necessary 
composed from contiguous allocations.

Christian.

>
> /Thomas
>