[RFC PATCH 2/4] drm/scheduler: implement a function to modify sched list

Fri Feb 28 07:47:10 UTC 2020

Am 28.02.20 um 06:08 schrieb Luben Tuikov:
> On 2020-02-27 4:40 p.m., Nirmoy Das wrote:
>> implement drm_sched_entity_modify_sched() which can modify existing
>> sched_list with a different one. This is going to be helpful when
>> userspace changes priority of a ctx/entity then driver can switch to
>> corresponding hw shced list for that priority
>>
>> Signed-off-by: Nirmoy Das <nirmoy.das at amd.com>
>> ---
>>   drivers/gpu/drm/scheduler/sched_entity.c | 24 ++++++++++++++++++++++++
>>   include/drm/gpu_scheduler.h              |  4 ++++
>>   2 files changed, 28 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>> index 63bccd201b97..711e9d504bcb 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -83,6 +83,30 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
>>   }
>>   EXPORT_SYMBOL(drm_sched_entity_init);
>>   
>> +/**
>> + * drm_sched_entity_modify_sched - Modify sched of an entity
>> + *
>> + * @entity: scheduler entity to init
>> + * @sched_list: the list of new drm scheds which will replace
>> + *		existing entity->sched_list
>> + * @num_sched_list: number of drm sched in sched_list
>> + *
>> + * Returns 0 on success or a negative error code on failure.
>> + */
>> +int drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
>> +				  struct drm_gpu_scheduler **sched_list,
>> +				  unsigned int num_sched_list)
>> +{
>> +	if (!(entity && sched_list && (num_sched_list == 0 || sched_list[0])))
>> +		return -EINVAL;
> This seems unmaintainable. I'd write it in its natural form:

This is probably just copy & pasted from the init function and complete 
overkill here.

>
> int drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
> 				  struct drm_gpu_scheduler **sched_list,
> 				  unsigned int num_sched_list)
> {
> 	if (entity && sched_list && (num_sched_list == 0 || sched_list[0] != NULL)) {
> 		entity->sched_list = num_sched_list > 1 ? sched_list : NULL;
> 		entity->num_sched_list = num_sched_list;
> 		return 0;
> 	} else {
> 		return -EINVAL;
> 	}
> }

Actually that's a rather bad idea. Error handling should always be in 
the form of:

if (check_error || missing_prerequisite)
     return_or_goto_cleanup;

> That's too heavy. Can we improve the architecture
> so we don't have to check for this in leaf functions like this one?
> We can just return a parameterization.
>
> Why would this be called with entity being NULL?
> Or with sched_list being NULL? Or num_sched_list being not zero
> yet sched_list[0] being NULL? Why not make sure that sched_list[0] is
> never NULL and that num_sched_list is greater than 0 always?
>
> Does this make it to user space?
> Why would the state of execution be one such that this is true/false
> for the code to return -EINVAL?
>  From patch 3/4 it seems that an error is printed inside amdgpu_ctx_priority_override()
> and the result is not propagated up the stack.
>
> I think we should improve the code where here this condition above
> is never true. Then we can use parameterization for those two
> statements below:
>
>> +
>> +	entity->sched_list = num_sched_list > 1 ? sched_list : NULL;
> So if we're here, we know from the top conditional that
> either num_sched_list is 0 or that sched_list[0] not NULL
> or both.
>
> So if num_sched_list is 0 or 1 we return NULL?
> And if num_sched_list is 2 or greater we return sched_list
> of which sched_list[0] could be NULL?

This is also copy&pasted from the init code and completely wrong here.

What we should do instead is just: WARN_ON(!num_sched_list || !sched_list);

And to the checking for not keeping around the scheduler list in the 
init function.

> Why not fix the architecture so that this is simply copied?

We had that and moved away from it because the scheduler list is 
actually const and shouldn't be allocated with each entity (which we can 
easily have thousands of).

Regards,
Christian.

>   (in which case
> we can simply index-parameterize it and simply copy it.
> Why are there so many checks everywhere?
>
>> +	entity->num_sched_list = num_sched_list;
>> +
> I mean, all we're doing in this function is initializing
> entity->sched_list and entity->num_sched_list. Why does this
> function have to be so complex and do so many checks?
> Can we improve/fix the architecture instead?
>
> Regards,
> Luben
>
>
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL(drm_sched_entity_modify_sched);
>> +
>>   /**
>>    * drm_sched_entity_is_idle - Check if entity is idle
>>    *
>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>> index 589be851f8a1..0c164a96d51b 100644
>> --- a/include/drm/gpu_scheduler.h
>> +++ b/include/drm/gpu_scheduler.h
>> @@ -297,6 +297,10 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched);
>>   int drm_sched_job_init(struct drm_sched_job *job,
>>   		       struct drm_sched_entity *entity,
>>   		       void *owner);
>> +int drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
>> +				  struct drm_gpu_scheduler **sched_list,
>> +                                  unsigned int num_sched_list);
>> +
>>   void drm_sched_job_cleanup(struct drm_sched_job *job);
>>   void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
>>   void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
>>