[PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini

Mon Oct 9 10:59:09 UTC 2017

Hi Nicolai & Monk,

well that sounds like exactly what I wanted to hear.

When Mesa already handles ECANCELED gracefully as return code for a lost 
context then I think we can move forward with that. Monk, would that be 
ok with you?

Nicolai, how hard would it be to handle ENODEV as failure for all 
currently existing contexts?

Monk, would it be ok with you when we return ENODEV only for existing 
context when VRAM is lost and/or we have a strict mode GPU reset? E.g. 
newly created contexts would continue work as they should.

Regards,
Christian.

Am 09.10.2017 um 12:49 schrieb Nicolai Hähnle:
> Hi Monk,
>
> Yes, you're right, we're only using ECANCELED internally. But as a 
> consequence, Mesa would already handle a kernel error of ECANCELED on 
> context loss correctly :)
>
> Cheers,
> Nicolai
>
> On 09.10.2017 12:35, Liu, Monk wrote:
>> Hi Christian
>>
>> You reject some of my patches that returns -ENODEV, with the cause 
>> that MESA doesn't do the handling on -ENODEV
>>
>> But if Nicolai can confirm that MESA do have a handling on 
>> -ECANCELED, then we need to overall align our error code, on detail 
>> below IOCTL can return error code:
>>
>> Amdgpu_cs_ioctl
>> Amdgpu_cs_wait_ioctl
>> Amdgpu_cs_wait_fences_ioctl
>> Amdgpu_info_ioctl
>>
>>
>> My patches is:
>> return -ENODEV on cs_ioctl if the context is detected guilty,
>> also return -ENODEV on cs_wait|cs_wait_fences if the fence is 
>> signaled but with error -ETIME,
>> also return -ENODEV on info_ioctl so UMD can query if gpu reset 
>> happened after the process created (because for strict mode we block 
>> process instead of context)
>>
>>
>> according to Nicolai:
>>
>> amdgpu_cs_ioctl *can* return -ECANCELED, but to be frankly speaking, 
>> kernel part doesn't have any place with "-ECANCELED" so this solution 
>> on MESA side doesn't align with *current* amdgpu driver,
>> which only return 0 on success or -EINVALID on other error but 
>> definitely no "-ECANCELED" error code,
>>
>> so if we talking about community rules we shouldn't let MESA handle 
>> -ECANCELED ,  we should have a unified error code
>>
>> + Marek
>>
>> BR Monk
>>
>>
>>
>>
>> -----Original Message-----
>> From: Haehnle, Nicolai
>> Sent: 2017年10月9日 18:14
>> To: Koenig, Christian <Christian.Koenig at amd.com>; Liu, Monk 
>> <Monk.Liu at amd.com>; Nicolai Hähnle <nhaehnle at gmail.com>; 
>> amd-gfx at lists.freedesktop.org
>> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining 
>> fences in amd_sched_entity_fini
>>
>> On 09.10.2017 10:02, Christian König wrote:
>>>> For gpu reset patches (already submitted to pub) I would make kernel
>>>> return -ENODEV if the waiting fence (in cs_wait or wait_fences IOCTL)
>>>> founded as error, that way UMD would run into robust extension path
>>>> and considering the GPU hang occurred,
>>> Well that is only closed source behavior which is completely
>>> irrelevant for upstream development.
>>>
>>> As far as I know we haven't pushed the change to return -ENODEV 
>>> upstream.
>>
>> FWIW, radeonsi currently expects -ECANCELED on CS submissions and 
>> treats those as context lost. Perhaps we could use the same error on 
>> fences?
>> That makes more sense to me than -ENODEV.
>>
>> Cheers,
>> Nicolai
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 09.10.2017 um 08:42 schrieb Liu, Monk:
>>>> Christian
>>>>
>>>>> It would be really nice to have an error code set on
>>>>> s_fence->finished before it is signaled, use dma_fence_set_error()
>>>>> for this.
>>>> For gpu reset patches (already submitted to pub) I would make kernel
>>>> return -ENODEV if the waiting fence (in cs_wait or wait_fences IOCTL)
>>>> founded as error, that way UMD would run into robust extension path
>>>> and considering the GPU hang occurred,
>>>>
>>>> Don't know if this is expected for the case of normal process being
>>>> killed or crashed like Nicolai hit ... since there is no gpu hang hit
>>>>
>>>>
>>>> BR Monk
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On
>>>> Behalf Of Christian K?nig
>>>> Sent: 2017年9月28日 23:01
>>>> To: Nicolai Hähnle <nhaehnle at gmail.com>;
>>>> amd-gfx at lists.freedesktop.org
>>>> Cc: Haehnle, Nicolai <Nicolai.Haehnle at amd.com>
>>>> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining
>>>> fences in amd_sched_entity_fini
>>>>
>>>> Am 28.09.2017 um 16:55 schrieb Nicolai Hähnle:
>>>>> From: Nicolai Hähnle <nicolai.haehnle at amd.com>
>>>>>
>>>>> Highly concurrent Piglit runs can trigger a race condition where a
>>>>> pending SDMA job on a buffer object is never executed because the
>>>>> corresponding process is killed (perhaps due to a crash). Since the
>>>>> job's fences were never signaled, the buffer object was effectively
>>>>> leaked. Worse, the buffer was stuck wherever it happened to be at
>>>>> the time, possibly in VRAM.
>>>>>
>>>>> The symptom was user space processes stuck in interruptible waits
>>>>> with kernel stacks like:
>>>>>
>>>>>        [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250
>>>>>        [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0
>>>>>        [<ffffffffbc5e82d2>]
>>>>> reservation_object_wait_timeout_rcu+0x1c2/0x300
>>>>>        [<ffffffffc03ce56f>] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0
>>>>> [ttm]
>>>>>        [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm]
>>>>>        [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm]
>>>>>        [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm]
>>>>>        [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm]
>>>>>        [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470
>>>>> [amdgpu]
>>>>>        [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu]
>>>>>        [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140
>>>>> [amdgpu]
>>>>>        [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120
>>>>> [amdgpu]
>>>>>        [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm]
>>>>>        [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu]
>>>>>        [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0
>>>>>        [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90
>>>>>        [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad
>>>>>        [<ffffffffffffffff>] 0xffffffffffffffff
>>>>>
>>>>> Signed-off-by: Nicolai Hähnle <nicolai.haehnle at amd.com>
>>>>> Acked-by: Christian König <christian.koenig at amd.com>
>>>>> ---
>>>>>     drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 7 ++++++-
>>>>>     1 file changed, 6 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>>> index 54eb77cffd9b..32a99e980d78 100644
>>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>>> @@ -220,22 +220,27 @@ void amd_sched_entity_fini(struct
>>>>> amd_gpu_scheduler *sched,
>>>>>                         amd_sched_entity_is_idle(entity));
>>>>>         amd_sched_rq_remove_entity(rq, entity);
>>>>>         if (r) {
>>>>>             struct amd_sched_job *job;
>>>>>             /* Park the kernel for a moment to make sure it isn't
>>>>> processing
>>>>>              * our enity.
>>>>>              */
>>>>>             kthread_park(sched->thread);
>>>>>             kthread_unpark(sched->thread);
>>>>> -        while (kfifo_out(&entity->job_queue, &job, sizeof(job)))
>>>>> +        while (kfifo_out(&entity->job_queue, &job, sizeof(job))) {
>>>>> +            struct amd_sched_fence *s_fence = job->s_fence;
>>>>> +            amd_sched_fence_scheduled(s_fence);
>>>> It would be really nice to have an error code set on
>>>> s_fence->finished before it is signaled, use dma_fence_set_error() 
>>>> for this.
>>>>
>>>> Additional to that it would be nice to note in the subject line that
>>>> this is a rather important bug fix.
>>>>
>>>> With that fixed the whole series is Reviewed-by: Christian König
>>>> <christian.koenig at amd.com>.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> + amd_sched_fence_finished(s_fence);
>>>>> +            dma_fence_put(&s_fence->finished);
>>>>>                 sched->ops->free_job(job);
>>>>> +        }
>>>>>         }
>>>>>         kfifo_free(&entity->job_queue);
>>>>>     }
>>>>>     static void amd_sched_entity_wakeup(struct dma_fence *f, struct
>>>>> dma_fence_cb *cb)
>>>>>     {
>>>>>         struct amd_sched_entity *entity =
>>>>>             container_of(cb, struct amd_sched_entity, cb);
>>>>>         entity->dependency = NULL;
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>
>>>
>>
>