force app kill patch

Wed Apr 18 08:36:07 UTC 2018

> See that in “sched_entity_fini”, we only call 
> dma_fence_put(entity->last_scheduled”under the condition of “If 
> (entity->fini_status)”, so
>
> This way there is memory leak for the case of “entity->fini_stats ==0”
>
Good catch, we indeed should fix that.

> 1.Drm_sched_entity_fini(): it exit right after entity->job_queue 
> empty, [ but that time scheduler is not fast enough to deal with this 
> entity now ] 
That should never happen.

The last job from the entity->job_queue is only removed after the 
scheduler is done with the entity (at least that was the original idea, 
not sure if that still works as expected).

Regards,
Christian.

Am 18.04.2018 um 09:20 schrieb Liu, Monk:
>
> *Correctio for the scenario *
>
> After we move fence_put(entity->last_sched) out of the fini_status check:
>
> A potential race issue for the scenario:
>
> 1.Drm_sched_entity_fini(): it exit right after entity->job_queue 
> empty, [ but that time scheduler is not fast enough to deal with this 
> entity now ]
>
> 2.Drm_sched_entity_cleanup() : it call 
> dma_fence_put(entity->last_scheduled)  [ but this time 
> entity->last_scheduled  actually points to the fence prior to the real 
> last one ]
>
> 3.Scheduler_main() now dealing with this entity: it call 
> dma_fence_put(entity->last_scheduled)    [   Now this fence get double 
> put !!!  ]
>
> 4.Scheduler_main() now call dma_fence_get() on the *real* last one !
>
> So eventually the real last one fence triggers memory leak and more 
> critical the double put fence cause NULL pointer access
>
> /Monk
>
> *From:*Liu, Monk
> *Sent:* 2018年4月18日15:11
> *To:* Koenig, Christian <Christian.Koenig at amd.com>; Deng, Emily 
> <Emily.Deng at amd.com>
> *Cc:* amd-gfx at lists.freedesktop.org
> *Subject:* force app kill patch
>
> Hi Christian & Emily
>
> I think the v4 fix for “fix force app kill hang”is still not good enough:
>
> First:
>
> See that in “sched_entity_fini”, we only call 
> dma_fence_put(entity->last_scheduled”under the condition of “If 
> (entity->fini_status)”, so
>
> This way there is memory leak for the case of “entity->fini_stats ==0”
>
> Second:
>
> If we move dma_fence_put(entity->last_scheduled) out of the condition 
> of “if (entity->fini_status)”, the memory leak issue can be fixed
>
> But there will be kernel NULL pointer access, I think the time you 
> call dma_fence_put(entity->last_scheduled”) may actually executed **not**
>
> On the last scheduled fence of this entity, because it is run without 
> “thread_park/unpark”pair which to make sure scheduler not dealing this 
> entity
>
> So with certain race issue, here is the scenario:
>
> 1.scheduler is doing the dma_fence_put() on the 1^st fence,
>
> 2.scheduler set entity->last_scheduled to 1^st fence
>
> 3.now sched_entity_fini() run, and it call dma_fence_put() on 
> entity->last_scheduled
>
> 4.now this 1^st fence is actually put double time and the real last 
> fence won’t get put by expected
>
> any idea?
>
> /Monk
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20180418/29157b21/attachment-0001.html>