[PATCH] drm/xe: skip error capture when exec queue is killed
Matthew Brost
matthew.brost at intel.com
Thu Apr 25 16:23:27 UTC 2024
On Thu, Apr 25, 2024 at 05:59:31PM +0530, Tejas Upadhyay wrote:
> When user closes exec queue soon after job submission,
> we are generating error coredump. Instead check if
> exec queue is killed during job timeout then skip
> error coredump capture, just free the job and return
> proper scheduler state.
>
> Signed-off-by: Tejas Upadhyay <tejas.upadhyay at intel.com>
> ---
> drivers/gpu/drm/xe/xe_guc_submit.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 93e1ee183e4a..376a2c04e899 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -971,7 +971,8 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
> * TDR has fired before free job worker. Common if exec queue
> * immediately closed after last fence signaled.
> */
> - if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &job->fence->flags)) {
> + if (exec_queue_killed(q) ||
You still need to timeout the job if the DMA_FENCE_FLAG_SIGNALED_BIT is
clear otherwise will never signal.
So it should be something like this:
- simple_error_capture(q);
- xe_devcoredump(job);
+ if (!exec_queue_killed(q)) {
+ simple_error_capture(q);
+ xe_devcoredump(job);
+ }
I think I've convinced myself skipping error the capture if correct in
this case. e.g. If a user ctrl-c an app, we shouldn't do an job capture
on the jobs which the KMD kills.
@Rodrigo, @Jose, Thoughts? I know both you when done a bit of work here.
Matt
> + test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &job->fence->flags)) {
> guc_exec_queue_free_job(drm_job);
>
> return DRM_GPU_SCHED_STAT_NOMINAL;
> --
> 2.25.1
>
More information about the Intel-xe
mailing list