[PATCH] drm/xe: skip error capture when exec queue is killed

Matthew Brost matthew.brost at intel.com
Thu Apr 25 16:23:27 UTC 2024


On Thu, Apr 25, 2024 at 05:59:31PM +0530, Tejas Upadhyay wrote:
> When user closes exec queue soon after job submission,
> we are generating error coredump. Instead check if
> exec queue is killed during job timeout then skip
> error coredump capture, just free the job and return
> proper scheduler state.
> 
> Signed-off-by: Tejas Upadhyay <tejas.upadhyay at intel.com>
> ---
>  drivers/gpu/drm/xe/xe_guc_submit.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 93e1ee183e4a..376a2c04e899 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -971,7 +971,8 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
>  	 * TDR has fired before free job worker. Common if exec queue
>  	 * immediately closed after last fence signaled.
>  	 */
> -	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &job->fence->flags)) {
> +	if (exec_queue_killed(q) || 

You still need to timeout the job if the DMA_FENCE_FLAG_SIGNALED_BIT is
clear otherwise will never signal.

So it should be something like this:

-       simple_error_capture(q);
-       xe_devcoredump(job);
+       if (!exec_queue_killed(q)) {
+               simple_error_capture(q);
+               xe_devcoredump(job);
+       }

I think I've convinced myself skipping error the capture if correct in
this case. e.g. If a user ctrl-c an app, we shouldn't do an job capture
on the jobs which the KMD kills.

@Rodrigo, @Jose, Thoughts? I know both you when done a bit of work here.

Matt

> +	    test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &job->fence->flags)) {
>  		guc_exec_queue_free_job(drm_job);
>  
>  		return DRM_GPU_SCHED_STAT_NOMINAL;
> -- 
> 2.25.1
> 


More information about the Intel-xe mailing list