[PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini

Tue Oct 10 04:00:26 UTC 2017

What about below plan?

1, ECANCELED is used for kernel to return an error upon guilty context submitting jobs, In order to prevent guilty context from submitting anymore ... or prevent all contexts from submitting if VRAM lost happens 
2, ENODEV is used for kernel to return error in cs_wait/fences_wait IOCTL, so UMD can be notified that the fence they are waiting on will never accomplished but just fake signaled due to GPU hang
  If you guys are picky on ENODEV, maybe ETIME/EBADFD  ??

3, ENODEV is used for kernel to return error in amdgpu_info_ioctl so UMD can call info_ioctl to acknowledge that the GPU is not work anymore, and after GPU reset done, info_ioctl can work as expected 
  Use info_ioctl there is because it doesn't need context as input parameter, because for VK UMD sometimes it want to query GPU healthy but no context on hand, only with FD/DEVICE

BR Monk

-----Original Message-----
From: Koenig, Christian 
Sent: 2017年10月9日 20:34
To: Haehnle, Nicolai <Nicolai.Haehnle at amd.com>; Liu, Monk <Monk.Liu at amd.com>; Nicolai Hähnle <nhaehnle at gmail.com>; amd-gfx at lists.freedesktop.org; Olsak, Marek <Marek.Olsak at amd.com>
Cc: Li, Bingley <Bingley.Li at amd.com>
Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini

Am 09.10.2017 um 13:27 schrieb Nicolai Hähnle:
> On 09.10.2017 13:12, Christian König wrote:
>>>
>>>> Nicolai, how hard would it be to handle ENODEV as failure for all 
>>>> currently existing contexts?
>>>
>>> Impossible? "All currently existing contexts" is not a well-defined 
>>> concept when multiple drivers co-exist in the same process.
>>
>> Ok, let me refine the question: I assume there are resources "shared" 
>> between contexts like binary shader code for example which needs to 
>> be reuploaded when VRAM is lost.
>>
>> How hard would it be to handle that correctly?
>
> Okay, that makes more sense :)
>
> With the current interface it's still pretty difficult, but if we 
> could get a new per-device query ioctl which returns a "VRAM loss 
> counter", it would be reasonably straight-forward.

The problem with the VRAM lost counter is that this isn't save either. 
E.g. you could have an application which uploads shaders, a GPU reset happens and VRAM is lost and then the application creates a new context and makes submission with broken shader binaries.

So I would still vote for a separate IOCTL to reset the VRAM lost state which is called *before" user space starts to reupload shader/descriptors etc...

This way you also catch the case when another reset happens while you reupload things.

>
>>> And what would be the purpose of this? If it's to support VRAM loss, 
>>> having a per-context VRAM loss counter would enable each context to 
>>> signal ECANCELED separately.
>>
>> I thought of that on top of the -ENODEV handling.
>>
>> In other words when we see -ENODEV we call an IOCTL to let the kernel 
>> know we noticed that something is wrong and then reinit all shared 
>> resources in userspace.
>>
>> All existing context will still see -ECANCELED when we drop their 
>> command submission, but new contexts would at least not cause a new 
>> lockup immediately because their shader binaries are corrupted.
>
> I don't think we need -ENODEV for this. We just need -ECANCELED to be 
> returned when a submission is rejected due to reset (hang or VRAM loss).
>
> Mesa would keep a shadow of the VRAM loss counter in pipe_screen and 
> pipe_context, and query the kernel's counter when it encounters 
> -ECANCELED. Each context would then know to drop the CS it's built up 
> so far and restart based on comparing the VRAM loss counter of 
> pipe_screen to that of pipe_context, and similarly we could keep a 
> copy of the VRAM loss counter for important buffer objects like shader 
> binaries, descriptors, etc.
>
> This seems more robust to me than relying only on an ENODEV. We'd most 
> likely keep some kind of VRAM loss counter in Mesa *anyway* (we don't 
> maintain a list of all shaders, for example, and we can't nuke 
> important per-context across threads), and synthesizing such a counter 
> from ENODEVs is not particularly robust (what if multiple ENODEVs 
> occur for the same loss event?).
>
> BTW, I still don't like ENODEV. It seems more like the kind of error 
> code you'd return with hot-pluggable GPUs where the device can 
> physically disappear...

Yeah, ECANCELED sounds like a better alternative. But I think we should still somehow note the fatality of loosing VRAM to userspace.

How about ENODATA or EBADFD?

Regards,
Christian.

>
> Cheers,
> Nicolai
>
>
>>
>> Regards,
>> Christian.
>>
>> Am 09.10.2017 um 13:04 schrieb Nicolai Hähnle:
>>> On 09.10.2017 12:59, Christian König wrote:
>>>> Nicolai, how hard would it be to handle ENODEV as failure for all 
>>>> currently existing contexts?
>>>
>>> Impossible? "All currently existing contexts" is not a well-defined 
>>> concept when multiple drivers co-exist in the same process.
>>>
>>> And what would be the purpose of this? If it's to support VRAM loss, 
>>> having a per-context VRAM loss counter would enable each context to 
>>> signal ECANCELED separately.
>>>
>>> Cheers,
>>> Nicolai
>>>
>>>
>>>>
>>>> Monk, would it be ok with you when we return ENODEV only for 
>>>> existing context when VRAM is lost and/or we have a strict mode GPU 
>>>> reset? E.g. newly created contexts would continue work as they should.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 09.10.2017 um 12:49 schrieb Nicolai Hähnle:
>>>>> Hi Monk,
>>>>>
>>>>> Yes, you're right, we're only using ECANCELED internally. But as a 
>>>>> consequence, Mesa would already handle a kernel error of ECANCELED 
>>>>> on context loss correctly :)
>>>>>
>>>>> Cheers,
>>>>> Nicolai
>>>>>
>>>>> On 09.10.2017 12:35, Liu, Monk wrote:
>>>>>> Hi Christian
>>>>>>
>>>>>> You reject some of my patches that returns -ENODEV, with the 
>>>>>> cause that MESA doesn't do the handling on -ENODEV
>>>>>>
>>>>>> But if Nicolai can confirm that MESA do have a handling on 
>>>>>> -ECANCELED, then we need to overall align our error code, on 
>>>>>> detail below IOCTL can return error code:
>>>>>>
>>>>>> Amdgpu_cs_ioctl
>>>>>> Amdgpu_cs_wait_ioctl
>>>>>> Amdgpu_cs_wait_fences_ioctl
>>>>>> Amdgpu_info_ioctl
>>>>>>
>>>>>>
>>>>>> My patches is:
>>>>>> return -ENODEV on cs_ioctl if the context is detected guilty, 
>>>>>> also return -ENODEV on cs_wait|cs_wait_fences if the fence is 
>>>>>> signaled but with error -ETIME, also return -ENODEV on info_ioctl 
>>>>>> so UMD can query if gpu reset happened after the process created 
>>>>>> (because for strict mode we block process instead of context)
>>>>>>
>>>>>>
>>>>>> according to Nicolai:
>>>>>>
>>>>>> amdgpu_cs_ioctl *can* return -ECANCELED, but to be frankly 
>>>>>> speaking, kernel part doesn't have any place with "-ECANCELED" so 
>>>>>> this solution on MESA side doesn't align with *current* amdgpu 
>>>>>> driver, which only return 0 on success or -EINVALID on other 
>>>>>> error but definitely no "-ECANCELED" error code,
>>>>>>
>>>>>> so if we talking about community rules we shouldn't let MESA 
>>>>>> handle -ECANCELED ,  we should have a unified error code
>>>>>>
>>>>>> + Marek
>>>>>>
>>>>>> BR Monk
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Haehnle, Nicolai
>>>>>> Sent: 2017年10月9日 18:14
>>>>>> To: Koenig, Christian <Christian.Koenig at amd.com>; Liu, Monk 
>>>>>> <Monk.Liu at amd.com>; Nicolai Hähnle <nhaehnle at gmail.com>; 
>>>>>> amd-gfx at lists.freedesktop.org
>>>>>> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining 
>>>>>> fences in amd_sched_entity_fini
>>>>>>
>>>>>> On 09.10.2017 10:02, Christian König wrote:
>>>>>>>> For gpu reset patches (already submitted to pub) I would make 
>>>>>>>> kernel return -ENODEV if the waiting fence (in cs_wait or 
>>>>>>>> wait_fences
>>>>>>>> IOCTL)
>>>>>>>> founded as error, that way UMD would run into robust extension 
>>>>>>>> path and considering the GPU hang occurred,
>>>>>>> Well that is only closed source behavior which is completely 
>>>>>>> irrelevant for upstream development.
>>>>>>>
>>>>>>> As far as I know we haven't pushed the change to return -ENODEV 
>>>>>>> upstream.
>>>>>>
>>>>>> FWIW, radeonsi currently expects -ECANCELED on CS submissions and 
>>>>>> treats those as context lost. Perhaps we could use the same error 
>>>>>> on fences?
>>>>>> That makes more sense to me than -ENODEV.
>>>>>>
>>>>>> Cheers,
>>>>>> Nicolai
>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>> Am 09.10.2017 um 08:42 schrieb Liu, Monk:
>>>>>>>> Christian
>>>>>>>>
>>>>>>>>> It would be really nice to have an error code set on 
>>>>>>>>> s_fence->finished before it is signaled, use
>>>>>>>>> dma_fence_set_error()
>>>>>>>>> for this.
>>>>>>>> For gpu reset patches (already submitted to pub) I would make 
>>>>>>>> kernel return -ENODEV if the waiting fence (in cs_wait or 
>>>>>>>> wait_fences
>>>>>>>> IOCTL)
>>>>>>>> founded as error, that way UMD would run into robust extension 
>>>>>>>> path and considering the GPU hang occurred,
>>>>>>>>
>>>>>>>> Don't know if this is expected for the case of normal process 
>>>>>>>> being killed or crashed like Nicolai hit ... since there is no 
>>>>>>>> gpu hang hit
>>>>>>>>
>>>>>>>>
>>>>>>>> BR Monk
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On
>>>>>>>> Behalf Of Christian K?nig
>>>>>>>> Sent: 2017年9月28日 23:01
>>>>>>>> To: Nicolai Hähnle <nhaehnle at gmail.com>;
>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>> Cc: Haehnle, Nicolai <Nicolai.Haehnle at amd.com>
>>>>>>>> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining
>>>>>>>> fences in amd_sched_entity_fini
>>>>>>>>
>>>>>>>> Am 28.09.2017 um 16:55 schrieb Nicolai Hähnle:
>>>>>>>>> From: Nicolai Hähnle <nicolai.haehnle at amd.com>
>>>>>>>>>
>>>>>>>>> Highly concurrent Piglit runs can trigger a race condition 
>>>>>>>>> where a
>>>>>>>>> pending SDMA job on a buffer object is never executed because the
>>>>>>>>> corresponding process is killed (perhaps due to a crash). 
>>>>>>>>> Since the
>>>>>>>>> job's fences were never signaled, the buffer object was 
>>>>>>>>> effectively
>>>>>>>>> leaked. Worse, the buffer was stuck wherever it happened to be at
>>>>>>>>> the time, possibly in VRAM.
>>>>>>>>>
>>>>>>>>> The symptom was user space processes stuck in interruptible waits
>>>>>>>>> with kernel stacks like:
>>>>>>>>>
>>>>>>>>>        [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250
>>>>>>>>>        [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0
>>>>>>>>>        [<ffffffffbc5e82d2>]
>>>>>>>>> reservation_object_wait_timeout_rcu+0x1c2/0x300
>>>>>>>>>        [<ffffffffc03ce56f>] 
>>>>>>>>> ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0
>>>>>>>>> [ttm]
>>>>>>>>>        [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm]
>>>>>>>>>        [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm]
>>>>>>>>>        [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm]
>>>>>>>>>        [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 
>>>>>>>>> [ttm]
>>>>>>>>>        [<ffffffffc042f523>] 
>>>>>>>>> amdgpu_bo_create_restricted+0x1f3/0x470
>>>>>>>>> [amdgpu]
>>>>>>>>>        [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu]
>>>>>>>>>        [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140
>>>>>>>>> [amdgpu]
>>>>>>>>>        [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120
>>>>>>>>> [amdgpu]
>>>>>>>>>        [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm]
>>>>>>>>>        [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu]
>>>>>>>>>        [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0
>>>>>>>>>        [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90
>>>>>>>>>        [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad
>>>>>>>>>        [<ffffffffffffffff>] 0xffffffffffffffff
>>>>>>>>>
>>>>>>>>> Signed-off-by: Nicolai Hähnle <nicolai.haehnle at amd.com>
>>>>>>>>> Acked-by: Christian König <christian.koenig at amd.com>
>>>>>>>>> ---
>>>>>>>>>     drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 7 ++++++-
>>>>>>>>>     1 file changed, 6 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>>>>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>>>>>>> index 54eb77cffd9b..32a99e980d78 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>>>>>>> @@ -220,22 +220,27 @@ void amd_sched_entity_fini(struct
>>>>>>>>> amd_gpu_scheduler *sched,
>>>>>>>>> amd_sched_entity_is_idle(entity));
>>>>>>>>>         amd_sched_rq_remove_entity(rq, entity);
>>>>>>>>>         if (r) {
>>>>>>>>>             struct amd_sched_job *job;
>>>>>>>>>             /* Park the kernel for a moment to make sure it isn't
>>>>>>>>> processing
>>>>>>>>>              * our enity.
>>>>>>>>>              */
>>>>>>>>>             kthread_park(sched->thread);
>>>>>>>>>             kthread_unpark(sched->thread);
>>>>>>>>> -        while (kfifo_out(&entity->job_queue, &job, sizeof(job)))
>>>>>>>>> +        while (kfifo_out(&entity->job_queue, &job, 
>>>>>>>>> sizeof(job))) {
>>>>>>>>> +            struct amd_sched_fence *s_fence = job->s_fence;
>>>>>>>>> +            amd_sched_fence_scheduled(s_fence);
>>>>>>>> It would be really nice to have an error code set on
>>>>>>>> s_fence->finished before it is signaled, use 
>>>>>>>> dma_fence_set_error() for this.
>>>>>>>>
>>>>>>>> Additional to that it would be nice to note in the subject line 
>>>>>>>> that
>>>>>>>> this is a rather important bug fix.
>>>>>>>>
>>>>>>>> With that fixed the whole series is Reviewed-by: Christian König
>>>>>>>> <christian.koenig at amd.com>.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>> + amd_sched_fence_finished(s_fence);
>>>>>>>>> + dma_fence_put(&s_fence->finished);
>>>>>>>>>                 sched->ops->free_job(job);
>>>>>>>>> +        }
>>>>>>>>>         }
>>>>>>>>>         kfifo_free(&entity->job_queue);
>>>>>>>>>     }
>>>>>>>>>     static void amd_sched_entity_wakeup(struct dma_fence *f, 
>>>>>>>>> struct
>>>>>>>>> dma_fence_cb *cb)
>>>>>>>>>     {
>>>>>>>>>         struct amd_sched_entity *entity =
>>>>>>>>>             container_of(cb, struct amd_sched_entity, cb);
>>>>>>>>>         entity->dependency = NULL;
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> amd-gfx mailing list
>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>> _______________________________________________
>>>>>>>> amd-gfx mailing list
>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>