[RFC PATCH] drm/amdgpu: allocate entities on demand

Mon Dec 2 14:59:09 UTC 2019

Am 02.12.19 um 15:43 schrieb Nirmoy:
>
> On 11/29/19 7:42 PM, Christian KÃ¶nig wrote:
>> Am 29.11.19 um 15:29 schrieb Nirmoy:
>>> Hi Christian,
>>>
>>> On 11/26/19 10:45 AM, Christian KÃ¶nig wrote:
>>>> It looks like a start, but there numerous things which needs to be 
>>>> fixed.
>>>>
>>>> Question number one is: What's that good for? Entities are not the 
>>>> problem here. The real issue is the fence ring and the rq_list.
>>>>
>>>> The rq_list could actually be made constant since it should never 
>>>> be changed by the entity. It is only changed for backward 
>>>> compatibility in drm_sched_entity_set_priority().
>>>>
>>>> So I would start there and cleanup the 
>>>> drm_sched_entity_set_priority() to actually just set a new constant 
>>>> rq list instead.
>>>
>>> I am missing some context here. Can you please explain bit more? I 
>>> looked over and over again but I still don't understand what do you 
>>> mean byÂ  new constant rq list :/
>>
>> Ok that needs a bit wider explanation.
>>
>> The GPU scheduler consists mainly of drm_gpu_scheduler instances. 
>> Each of those instances contain multiple runqueues with different 
>> priorities (5 IIRC).
>>
>> Now for each entity we give a list of runqueues where this entity can 
>> be served on, e.g. where the jobs which are pushed to the entities 
>> are executed.
>>
>> The entity itself keeps a copy of that runqueue list because we have 
>> the drm_sched_entity_set_priority() which modifies this runqueue list.
>>
>> But essentially that is complete overkill, the runqueue lists are 
>> constant for each amdgpu device, e.g. all contexts should use SDMA0 
>> and SDMA1 in the same way.
>>
>> In other words building the list on runqueues should happen only once 
>> and not for each contexts.
> Okay I understand now the real problem. Thanks for detail explanation.
>>
>> Multiple approach to fix this would be possible. One rather elegant 
>> solution would be to change the rq list into a scheduler instances 
>> list + priority.
>
> Do you mean something like
>
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 684692a8ed76..ac67f8f098fa 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -81,7 +81,7 @@ enum drm_sched_priority {
> Â struct drm_sched_entity {
> Â Â Â Â Â Â Â  struct list_headÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â  list;
> Â Â Â Â Â Â Â  struct drm_sched_rqÂ Â Â Â Â Â Â Â Â Â Â Â  *rq;
> -Â Â Â Â Â Â  struct drm_sched_rqÂ Â Â Â Â Â Â Â Â Â Â Â  **rq_list;
> +Â Â Â Â Â  struct drm_gpu_schedulerÂ Â Â Â Â Â Â  **sched;
> Â Â Â Â Â Â Â  unsigned intÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  num_rq_list;
> Â Â Â Â Â Â Â  spinlock_tÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  rq_lock;

Yes, exactly. Problem is that I'm not 100% sure if that really works 
with all users of the rq_list.

Regards,
Christian.

>
>>
>>
>> This way we would also fix the age old bug that changing the priority 
>> of a context could actually mess up already scheduled jobs.
>>
>> The alternative I noted before would be to drop 
>> drm_sched_entity_set_priority() or change it into 
>> drm_sched_entity_set_runqueues().
> I was working on it but then I got stuck by aÂ  "BUG: kernel NULL 
> pointer dereference, address:" which I am trying to figure out why
>>
>> Regards,
>> Christian.
>>
>>>
>>>>
>>>> Then we could embed the fences in amdgpu_ctx_entity as dynamic 
>>>> array at the end of the structure.
>>>>
>>>> And last we can start to dynamic allocate and initialize the 
>>>> amdgpu_ctx_entity() structures.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>
>>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx