[PATCH v2 6/8] drm/etnaviv: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset

Maíra Canal mcanal at igalia.com
Mon Jun 2 11:36:24 UTC 2025


Hi Philipp,

On 02/06/25 04:28, Philipp Stanner wrote:
> On Fri, 2025-05-30 at 11:01 -0300, Maíra Canal wrote:

[...]

>> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>> b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>> index
>> 7146069a98492f5fab2a49d96e2054f649e1fe3d..46f5391e84a12232b247886cf13
>> 11f8e09f42f04 100644
>> --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>> +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>> @@ -40,11 +40,11 @@ static enum drm_gpu_sched_stat
>> etnaviv_sched_timedout_job(struct drm_sched_job
>>   	int change;
>>   
>>   	/*
>> -	 * If the GPU managed to complete this jobs fence, the
>> timout is
>> -	 * spurious. Bail out.
>> +	 * If the GPU managed to complete this jobs fence, the
>> timeout has
>> +	 * fired before free-job worker. The timeout is spurious, so
>> bail out.
>>   	 */
>>   	if (dma_fence_is_signaled(submit->out_fence))
>> -		goto out_no_timeout;
>> +		return DRM_GPU_SCHED_STAT_NO_HANG;
>>   
>>   	/*
>>   	 * If the GPU is still making forward progress on the front-
>> end (which
>> @@ -70,7 +70,7 @@ static enum drm_gpu_sched_stat
>> etnaviv_sched_timedout_job(struct drm_sched_job
>>   		gpu->hangcheck_dma_addr = dma_addr;
>>   		gpu->hangcheck_primid = primid;
>>   		gpu->hangcheck_fence = gpu->completed_fence;
>> -		goto out_no_timeout;
>> +		return DRM_GPU_SCHED_STAT_NO_HANG;
>>   	}
>>   
>>   	/* block scheduler */
>> @@ -86,10 +86,7 @@ static enum drm_gpu_sched_stat
>> etnaviv_sched_timedout_job(struct drm_sched_job
>>   	drm_sched_resubmit_jobs(&gpu->sched);
>>   
>>   	drm_sched_start(&gpu->sched, 0);
>> -	return DRM_GPU_SCHED_STAT_RESET;
>>   
>> -out_no_timeout:
>> -	list_add(&sched_job->list, &sched_job->sched->pending_list);
> 
> Here you actually remove the manipulation of the scheduler internals,
> but you didn't in v3d. Just to point that out.
> 
> 
> And BTW I'm just seeing that the pending_list gets manipulated here
> with the scheduler's workqueues running and no locks being hold.
> 
> Oh man :(
> 
> That is most certainly a bug, and I recommend that the etnaviv
> maintainers at least add the appropriate lock here and backport that
> since it can race any time.
> 
> 
> But thx for working on that, Maíra. Good that we can remove the stuff
> this way.
> 
> Thinking about it, this patch even fixes a bug. So could contain a
> Fixes: tag. But I'm not sure if it's worth it to mark the entire series
> for Stable. Opinions?

I believe the best way to fix this bug would be doing something similar
to what I did to v3d: send a temporary fix adding the locks, which will
be backported to stable. I'll send a fix to Etnaviv today.

Thanks for the review, Phillip!

Best Regards,
- Maíra

> 
> 
> P.
> 
> 
>>   	return DRM_GPU_SCHED_STAT_RESET;
>>   }
>>   
>>
> 



More information about the Intel-xe mailing list