[PATCH 1/8] drm/scheduler: properly forward fence errors
Christian König
ckoenig.leichtzumerken at gmail.com
Wed Aug 23 08:26:38 UTC 2023
This was fixed here:
commit 03877d621db082610c9b7602c6e8cd6ebcb75a8f
Author: Christian König <christian.koenig at amd.com>
Date: Thu Apr 27 14:05:43 2023 +0200
drm/scheduler: mark jobs without fence as canceled
When no hw fence is provided for a job that means that the job
didn't executed.
Signed-off-by: Christian König <christian.koenig at amd.com>
Reviewed-by: Luben Tuikov <luben.tuikov at amd.com>
Link:
https://patchwork.freedesktop.org/patch/msgid/20230427122726.1290170-1-christian.koenig@amd.com
Could be that the patch hasn't been merged into the internal branches yet.
Regards,
Christian.
Am 23.08.23 um 10:12 schrieb Yin, ZhenGuo (Chris):
> [AMD Official Use Only - General]
>
> Ping..
>
> Actually, I prepare a patch aiming to fix this issue.
> But I'm not sure whether this is proper for drm/scheduler.
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 9654e8942382..35dc0b86a18e 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -463,6 +463,7 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> &s_job->cb)) {
> dma_fence_put(s_job->s_fence->parent);
> s_job->s_fence->parent = NULL;
> + dma_fence_set_error(&s_job->s_fence->finished, -EHWPOISON);
> atomic_dec(&sched->hw_rq_count);
> } else {
> /*
> Best,
> Zhenguo
> Cloud-GPU Core team, SRDC
>
> -----Original Message-----
> From: Yin, ZhenGuo (Chris)
> Sent: Thursday, August 17, 2023 4:17 PM
> To: Christian König <ckoenig.leichtzumerken at gmail.com>; amd-gfx at lists.freedesktop.org
> Cc: Tuikov, Luben <Luben.Tuikov at amd.com>; Chen, JingWen (Wayne) <JingWen.Chen2 at amd.com>; Liu, Monk <Monk.Liu at amd.com>; Li, Chong(Alan) <chong.li at amd.com>; cao, lin <lin.cao at amd.com>
> Subject: RE: [PATCH 1/8] drm/scheduler: properly forward fence errors
>
> Hi, @Christian König
>
> Any updates for the fix?
> Recently we found that there will be a page fault after FLR, since an SDMA job in the pending list was dropped without forwarding fence errors.
>
>
> Best,
> Zhenguo
> Cloud-GPU Core team, SRDC
>
> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken at gmail.com>
> Sent: Thursday, April 27, 2023 8:05 PM
> To: Yin, ZhenGuo (Chris) <ZhenGuo.Yin at amd.com>; amd-gfx at lists.freedesktop.org
> Cc: Tuikov, Luben <Luben.Tuikov at amd.com>; Chen, JingWen (Wayne) <JingWen.Chen2 at amd.com>; Liu, Monk <Monk.Liu at amd.com>
> Subject: Re: [PATCH 1/8] drm/scheduler: properly forward fence errors
>
> Well good point, but as part of the effort of the Intel team to move the scheduler over to a work item based design those two functions are probably about to be removed.
>
> Since we will probably have that in the internal package for a bit longer I'm going to send a fix for this.
>
> Regards,
> Christian.
>
> Am 27.04.23 um 12:35 schrieb Yin, ZhenGuo (Chris):
>> [AMD Official Use Only - General]
>>
>> Hi, Christian
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index fcd4bfef7415..649fac2e1ccb 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -533,12 +533,12 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
>> r = dma_fence_add_callback(fence, &s_job->cb,
>> drm_sched_job_done_cb);
>> if (r == -ENOENT)
>> - drm_sched_job_done(s_job);
>> + drm_sched_job_done(s_job, fence->error);
>> else if (r)
>> DRM_DEV_ERROR(sched->dev, "fence add callback failed (%d)\n",
>> r);
>> } else
>> - drm_sched_job_done(s_job);
>> + drm_sched_job_done(s_job, 0);
>> }
>>
>> if (full_recovery) {
>>
>> I believe that the finished fence of some skipped jobs during FLR HASN'T been set to -ECANCELED.
>> In function drm_sched_stop, the callback has been removed from hw_fence and s_fence->parent has been set to NULL, see commit 45ecaea738830b9d521c93520c8f201359dcbd95(drm/sched: Partial revert of 'drm/sched: Keep s_fence->parent pointer').
>> In functnion drm_sched_start, jobs in the pending list pretend to be done without any errors(drm_sched_job_done(s_job, 0)).
>>
>>
>> Best,
>> Zhenguo
>> Cloud-GPU Core team, SRDC
>>
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of
>> Christian König
>> Sent: Thursday, April 20, 2023 7:58 PM
>> To: amd-gfx at lists.freedesktop.org
>> Cc: Tuikov, Luben <Luben.Tuikov at amd.com>
>> Subject: [PATCH 1/8] drm/scheduler: properly forward fence errors
>>
>> When a hw fence is signaled with an error properly forward that to the finished fence.
>>
>> Signed-off-by: Christian König <christian.koenig at amd.com>
>> ---
>> drivers/gpu/drm/scheduler/sched_entity.c | 4 +--- drivers/gpu/drm/scheduler/sched_fence.c | 4 +++-
>> drivers/gpu/drm/scheduler/sched_main.c | 18 ++++++++----------
>> include/drm/gpu_scheduler.h | 2 +-
>> 4 files changed, 13 insertions(+), 15 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>> b/drivers/gpu/drm/scheduler/sched_entity.c
>> index 15d04a0ec623..eaf71fe15ed3 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -144,7 +144,7 @@ static void drm_sched_entity_kill_jobs_work(struct work_struct *wrk) {
>> struct drm_sched_job *job = container_of(wrk, typeof(*job), work);
>>
>> - drm_sched_fence_finished(job->s_fence);
>> + drm_sched_fence_finished(job->s_fence, -ESRCH);
>> WARN_ON(job->s_fence->parent);
>> job->sched->ops->free_job(job);
>> }
>> @@ -195,8 +195,6 @@ static void drm_sched_entity_kill(struct drm_sched_entity *entity)
>> while ((job = to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {
>> struct drm_sched_fence *s_fence = job->s_fence;
>>
>> - dma_fence_set_error(&s_fence->finished, -ESRCH);
>> -
>> dma_fence_get(&s_fence->finished);
>> if (!prev || dma_fence_add_callback(prev, &job->finish_cb,
>> drm_sched_entity_kill_jobs_cb)) diff --git
>> a/drivers/gpu/drm/scheduler/sched_fence.c
>> b/drivers/gpu/drm/scheduler/sched_fence.c
>> index 7fd869520ef2..1a6bea98c5cc 100644
>> --- a/drivers/gpu/drm/scheduler/sched_fence.c
>> +++ b/drivers/gpu/drm/scheduler/sched_fence.c
>> @@ -53,8 +53,10 @@ void drm_sched_fence_scheduled(struct drm_sched_fence *fence)
>> dma_fence_signal(&fence->scheduled);
>> }
>>
>> -void drm_sched_fence_finished(struct drm_sched_fence *fence)
>> +void drm_sched_fence_finished(struct drm_sched_fence *fence, int
>> +result)
>> {
>> + if (result)
>> + dma_fence_set_error(&fence->finished, result);
>> dma_fence_signal(&fence->finished);
>> }
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index fcd4bfef7415..649fac2e1ccb 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -257,7 +257,7 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>> *
>> * Finish the job's fence and wake up the worker thread.
>> */
>> -static void drm_sched_job_done(struct drm_sched_job *s_job)
>> +static void drm_sched_job_done(struct drm_sched_job *s_job, int
>> +result)
>> {
>> struct drm_sched_fence *s_fence = s_job->s_fence;
>> struct drm_gpu_scheduler *sched = s_fence->sched; @@ -268,7 +268,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job)
>> trace_drm_sched_process_job(s_fence);
>>
>> dma_fence_get(&s_fence->finished);
>> - drm_sched_fence_finished(s_fence);
>> + drm_sched_fence_finished(s_fence, result);
>> dma_fence_put(&s_fence->finished);
>> wake_up_interruptible(&sched->wake_up_worker);
>> }
>> @@ -282,7 +282,7 @@ static void drm_sched_job_done_cb(struct dma_fence *f, struct dma_fence_cb *cb) {
>> struct drm_sched_job *s_job = container_of(cb, struct
>> drm_sched_job, cb);
>>
>> - drm_sched_job_done(s_job);
>> + drm_sched_job_done(s_job, f->error);
>> }
>>
>> /**
>> @@ -533,12 +533,12 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
>> r = dma_fence_add_callback(fence, &s_job->cb,
>> drm_sched_job_done_cb);
>> if (r == -ENOENT)
>> - drm_sched_job_done(s_job);
>> + drm_sched_job_done(s_job, fence->error);
>> else if (r)
>> DRM_DEV_ERROR(sched->dev, "fence add callback failed (%d)\n",
>> r);
>> } else
>> - drm_sched_job_done(s_job);
>> + drm_sched_job_done(s_job, 0);
>> }
>>
>> if (full_recovery) {
>> @@ -1010,15 +1010,13 @@ static int drm_sched_main(void *param)
>> r = dma_fence_add_callback(fence, &sched_job->cb,
>> drm_sched_job_done_cb);
>> if (r == -ENOENT)
>> - drm_sched_job_done(sched_job);
>> + drm_sched_job_done(sched_job, fence->error);
>> else if (r)
>> DRM_DEV_ERROR(sched->dev, "fence add callback failed (%d)\n",
>> r);
>> } else {
>> - if (IS_ERR(fence))
>> - dma_fence_set_error(&s_fence->finished, PTR_ERR(fence));
>> -
>> - drm_sched_job_done(sched_job);
>> + drm_sched_job_done(sched_job, IS_ERR(fence) ?
>> + PTR_ERR(fence) : 0);
>> }
>>
>> wake_up(&sched->job_scheduled);
>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>> index ca857ec9e7eb..5c1df6b12ced 100644
>> --- a/include/drm/gpu_scheduler.h
>> +++ b/include/drm/gpu_scheduler.h
>> @@ -569,7 +569,7 @@ void drm_sched_fence_init(struct drm_sched_fence
>> *fence, void drm_sched_fence_free(struct drm_sched_fence *fence);
>>
>> void drm_sched_fence_scheduled(struct drm_sched_fence *fence); -void
>> drm_sched_fence_finished(struct drm_sched_fence *fence);
>> +void drm_sched_fence_finished(struct drm_sched_fence *fence, int
>> +result);
>>
>> unsigned long drm_sched_suspend_timeout(struct drm_gpu_scheduler
>> *sched); void drm_sched_resume_timeout(struct drm_gpu_scheduler
>> *sched,
>> --
>> 2.34.1
More information about the amd-gfx
mailing list