[PATCH] drm/amdgpu:fix gpu recover missing skipping(v2)

Wed Nov 8 09:45:45 UTC 2017

Am 08.11.2017 um 08:08 schrieb Monk Liu:
> if app close CTX right after IB submit, gpu recover
> will fail to find out the entity behind this guilty
> job thus lead to no job skipping for this guilty job.
>
> to fix this corner case just move the increasement of
> job->karma out of the entity iteration.
>
> v2:
> only do karma increasment if bad->s_priority != KERNEL
> because we always consider KERNEL job be correct and always
> want to recover an unfinished kernel job (sometimes kernel
> job is interrupted by VF FLR or other GPU hang event)

Good point, my rb still stands on that version.

Christian.

>
> Change-Id: I33e9e959e182d7e002a2108e565cb898acac4f9c
> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
> ---
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> index 7aa6455..c999026 100644
> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> @@ -463,7 +463,8 @@ void amd_sched_hw_job_reset(struct amd_gpu_scheduler *sched, struct amd_sched_jo
>   	}
>   	spin_unlock(&sched->job_list_lock);
>   
> -	if (bad) {
> +	if (bad && bad->s_priority != AMD_SCHED_PRIORITY_KERNEL) {
> +		atomic_inc(&bad->karma);
>   		/* don't increase @bad's karma if it's from KERNEL RQ,
>   		 * becuase sometimes GPU hang would cause kernel jobs (like VM updating jobs)
>   		 * corrupt but keep in mind that kernel jobs always considered good.
> @@ -474,7 +475,7 @@ void amd_sched_hw_job_reset(struct amd_gpu_scheduler *sched, struct amd_sched_jo
>   			spin_lock(&rq->lock);
>   			list_for_each_entry_safe(entity, tmp, &rq->entities, list) {
>   				if (bad->s_fence->scheduled.context == entity->fence_context) {
> -				    if (atomic_inc_return(&bad->karma) > bad->sched->hang_limit)
> +				    if (atomic_read(&bad->karma) > bad->sched->hang_limit)
>   						if (entity->guilty)
>   							atomic_set(entity->guilty, 1);
>   					break;