[PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini

Nicolai Hähnle nicolai.haehnle at amd.com
Mon Oct 9 10:49:19 UTC 2017


Hi Monk,

Yes, you're right, we're only using ECANCELED internally. But as a 
consequence, Mesa would already handle a kernel error of ECANCELED on 
context loss correctly :)

Cheers,
Nicolai

On 09.10.2017 12:35, Liu, Monk wrote:
> Hi Christian
> 
> You reject some of my patches that returns -ENODEV, with the cause that MESA doesn't do the handling on -ENODEV
> 
> But if Nicolai can confirm that MESA do have a handling on -ECANCELED, then we need to overall align our error code, on detail below IOCTL can return error code:
> 
> Amdgpu_cs_ioctl
> Amdgpu_cs_wait_ioctl
> Amdgpu_cs_wait_fences_ioctl
> Amdgpu_info_ioctl
> 
> 
> My patches is:
> return -ENODEV on cs_ioctl if the context is detected guilty,
> also return -ENODEV on cs_wait|cs_wait_fences if the fence is signaled but with error -ETIME,
> also return -ENODEV on info_ioctl so UMD can query if gpu reset happened after the process created (because for strict mode we block process instead of context)
> 
> 
> according to Nicolai:
> 
> amdgpu_cs_ioctl *can* return -ECANCELED, but to be frankly speaking, kernel part doesn't have any place with "-ECANCELED" so this solution on MESA side doesn't align with *current* amdgpu driver,
> which only return 0 on success or -EINVALID on other error but definitely no "-ECANCELED" error code,
> 
> so if we talking about community rules we shouldn't let MESA handle -ECANCELED ,  we should have a unified error code
> 
> + Marek
> 
> BR Monk
> 
>   
> 
> 
> 
> -----Original Message-----
> From: Haehnle, Nicolai
> Sent: 2017年10月9日 18:14
> To: Koenig, Christian <Christian.Koenig at amd.com>; Liu, Monk <Monk.Liu at amd.com>; Nicolai Hähnle <nhaehnle at gmail.com>; amd-gfx at lists.freedesktop.org
> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini
> 
> On 09.10.2017 10:02, Christian König wrote:
>>> For gpu reset patches (already submitted to pub) I would make kernel
>>> return -ENODEV if the waiting fence (in cs_wait or wait_fences IOCTL)
>>> founded as error, that way UMD would run into robust extension path
>>> and considering the GPU hang occurred,
>> Well that is only closed source behavior which is completely
>> irrelevant for upstream development.
>>
>> As far as I know we haven't pushed the change to return -ENODEV upstream.
> 
> FWIW, radeonsi currently expects -ECANCELED on CS submissions and treats those as context lost. Perhaps we could use the same error on fences?
> That makes more sense to me than -ENODEV.
> 
> Cheers,
> Nicolai
> 
>>
>> Regards,
>> Christian.
>>
>> Am 09.10.2017 um 08:42 schrieb Liu, Monk:
>>> Christian
>>>
>>>> It would be really nice to have an error code set on
>>>> s_fence->finished before it is signaled, use dma_fence_set_error()
>>>> for this.
>>> For gpu reset patches (already submitted to pub) I would make kernel
>>> return -ENODEV if the waiting fence (in cs_wait or wait_fences IOCTL)
>>> founded as error, that way UMD would run into robust extension path
>>> and considering the GPU hang occurred,
>>>
>>> Don't know if this is expected for the case of normal process being
>>> killed or crashed like Nicolai hit ... since there is no gpu hang hit
>>>
>>>
>>> BR Monk
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On
>>> Behalf Of Christian K?nig
>>> Sent: 2017年9月28日 23:01
>>> To: Nicolai Hähnle <nhaehnle at gmail.com>;
>>> amd-gfx at lists.freedesktop.org
>>> Cc: Haehnle, Nicolai <Nicolai.Haehnle at amd.com>
>>> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining
>>> fences in amd_sched_entity_fini
>>>
>>> Am 28.09.2017 um 16:55 schrieb Nicolai Hähnle:
>>>> From: Nicolai Hähnle <nicolai.haehnle at amd.com>
>>>>
>>>> Highly concurrent Piglit runs can trigger a race condition where a
>>>> pending SDMA job on a buffer object is never executed because the
>>>> corresponding process is killed (perhaps due to a crash). Since the
>>>> job's fences were never signaled, the buffer object was effectively
>>>> leaked. Worse, the buffer was stuck wherever it happened to be at
>>>> the time, possibly in VRAM.
>>>>
>>>> The symptom was user space processes stuck in interruptible waits
>>>> with kernel stacks like:
>>>>
>>>>        [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250
>>>>        [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0
>>>>        [<ffffffffbc5e82d2>]
>>>> reservation_object_wait_timeout_rcu+0x1c2/0x300
>>>>        [<ffffffffc03ce56f>] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0
>>>> [ttm]
>>>>        [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm]
>>>>        [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm]
>>>>        [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm]
>>>>        [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm]
>>>>        [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470
>>>> [amdgpu]
>>>>        [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu]
>>>>        [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140
>>>> [amdgpu]
>>>>        [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120
>>>> [amdgpu]
>>>>        [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm]
>>>>        [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu]
>>>>        [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0
>>>>        [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90
>>>>        [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad
>>>>        [<ffffffffffffffff>] 0xffffffffffffffff
>>>>
>>>> Signed-off-by: Nicolai Hähnle <nicolai.haehnle at amd.com>
>>>> Acked-by: Christian König <christian.koenig at amd.com>
>>>> ---
>>>>     drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 7 ++++++-
>>>>     1 file changed, 6 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>> index 54eb77cffd9b..32a99e980d78 100644
>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>> @@ -220,22 +220,27 @@ void amd_sched_entity_fini(struct
>>>> amd_gpu_scheduler *sched,
>>>>                         amd_sched_entity_is_idle(entity));
>>>>         amd_sched_rq_remove_entity(rq, entity);
>>>>         if (r) {
>>>>             struct amd_sched_job *job;
>>>>             /* Park the kernel for a moment to make sure it isn't
>>>> processing
>>>>              * our enity.
>>>>              */
>>>>             kthread_park(sched->thread);
>>>>             kthread_unpark(sched->thread);
>>>> -        while (kfifo_out(&entity->job_queue, &job, sizeof(job)))
>>>> +        while (kfifo_out(&entity->job_queue, &job, sizeof(job))) {
>>>> +            struct amd_sched_fence *s_fence = job->s_fence;
>>>> +            amd_sched_fence_scheduled(s_fence);
>>> It would be really nice to have an error code set on
>>> s_fence->finished before it is signaled, use dma_fence_set_error() for this.
>>>
>>> Additional to that it would be nice to note in the subject line that
>>> this is a rather important bug fix.
>>>
>>> With that fixed the whole series is Reviewed-by: Christian König
>>> <christian.koenig at amd.com>.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> +            amd_sched_fence_finished(s_fence);
>>>> +            dma_fence_put(&s_fence->finished);
>>>>                 sched->ops->free_job(job);
>>>> +        }
>>>>         }
>>>>         kfifo_free(&entity->job_queue);
>>>>     }
>>>>     static void amd_sched_entity_wakeup(struct dma_fence *f, struct
>>>> dma_fence_cb *cb)
>>>>     {
>>>>         struct amd_sched_entity *entity =
>>>>             container_of(cb, struct amd_sched_entity, cb);
>>>>         entity->dependency = NULL;
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>>
> 



More information about the amd-gfx mailing list