[RFC PATCH] drm/amdgpu: allocate entities on demand

Mon Dec 2 14:43:39 UTC 2019

On 11/29/19 7:42 PM, Christian KÃ¶nig wrote:
> Am 29.11.19 um 15:29 schrieb Nirmoy:
>> Hi Christian,
>>
>> On 11/26/19 10:45 AM, Christian KÃ¶nig wrote:
>>> It looks like a start, but there numerous things which needs to be 
>>> fixed.
>>>
>>> Question number one is: What's that good for? Entities are not the 
>>> problem here. The real issue is the fence ring and the rq_list.
>>>
>>> The rq_list could actually be made constant since it should never be 
>>> changed by the entity. It is only changed for backward compatibility 
>>> in drm_sched_entity_set_priority().
>>>
>>> So I would start there and cleanup the 
>>> drm_sched_entity_set_priority() to actually just set a new constant 
>>> rq list instead.
>>
>> I am missing some context here. Can you please explain bit more? I 
>> looked over and over again but I still don't understand what do you 
>> mean byÂ  new constant rq list :/
>
> Ok that needs a bit wider explanation.
>
> The GPU scheduler consists mainly of drm_gpu_scheduler instances. Each 
> of those instances contain multiple runqueues with different 
> priorities (5 IIRC).
>
> Now for each entity we give a list of runqueues where this entity can 
> be served on, e.g. where the jobs which are pushed to the entities are 
> executed.
>
> The entity itself keeps a copy of that runqueue list because we have 
> the drm_sched_entity_set_priority() which modifies this runqueue list.
>
> But essentially that is complete overkill, the runqueue lists are 
> constant for each amdgpu device, e.g. all contexts should use SDMA0 
> and SDMA1 in the same way.
>
> In other words building the list on runqueues should happen only once 
> and not for each contexts.
Okay I understand now the real problem. Thanks for detail explanation.
>
> Multiple approach to fix this would be possible. One rather elegant 
> solution would be to change the rq list into a scheduler instances 
> list + priority.

Do you mean something like

diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 684692a8ed76..ac67f8f098fa 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -81,7 +81,7 @@ enum drm_sched_priority {
 Â struct drm_sched_entity {
 Â Â Â Â Â Â Â  struct list_headÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â  list;
 Â Â Â Â Â Â Â  struct drm_sched_rqÂ Â Â Â Â Â Â Â Â Â Â Â  *rq;
-Â Â Â Â Â Â  struct drm_sched_rqÂ Â Â Â Â Â Â Â Â Â Â Â  **rq_list;
+Â Â Â Â Â  struct drm_gpu_schedulerÂ Â Â Â Â Â Â  **sched;
 Â Â Â Â Â Â Â  unsigned intÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  num_rq_list;
 Â Â Â Â Â Â Â  spinlock_tÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  rq_lock;

>
>
> This way we would also fix the age old bug that changing the priority 
> of a context could actually mess up already scheduled jobs.
>
> The alternative I noted before would be to drop 
> drm_sched_entity_set_priority() or change it into 
> drm_sched_entity_set_runqueues().
I was working on it but then I got stuck by aÂ  "BUG: kernel NULL pointer 
dereference, address:" which I am trying to figure out why
>
> Regards,
> Christian.
>
>>
>>>
>>> Then we could embed the fences in amdgpu_ctx_entity as dynamic array 
>>> at the end of the structure.
>>>
>>> And last we can start to dynamic allocate and initialize the 
>>> amdgpu_ctx_entity() structures.
>>>
>>> Regards,
>>> Christian.
>>>
>>>
>