force app kill patch
Christian König
ckoenig.leichtzumerken at gmail.com
Wed Apr 18 09:03:56 UTC 2018
Am 18.04.2018 um 11:00 schrieb Liu, Monk:
>
> 1.Drm_sched_entity_fini(): it exit right after entity->job_queue
> empty, [ but that time scheduler is not fast enough to deal with this
> entity now ]
>
> That should never happen.
>
> The last job from the entity->job_queue is only removed after the
> scheduler is done with the entity (at least that was the original
> idea, not sure if that still works as expected).
>
> [ML] no that’s not true and we already catch the kernel NULL pointer
> issue with a entity->last_scheduled fence get double put , exactly
> like the way I described in the scenario …
>
> Pixel already fixed it by moving the put/get pair on
> entity->last_scheduled prior to spsc_queue_pop() and the race issue is
> therefore avoided
>
Yeah, already seen and reviewed that. That's a good catch, please make
sure that this gets pushed to amd-staging-drm-next ASAP.
Christian.
> /Monk
>
> *From:*Christian König [mailto:ckoenig.leichtzumerken at gmail.com]
> *Sent:* 2018年4月18日16:36
> *To:* Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian
> <Christian.Koenig at amd.com>; Deng, Emily <Emily.Deng at amd.com>
> *Cc:* amd-gfx at lists.freedesktop.org
> *Subject:* Re: force app kill patch
>
> See that in “sched_entity_fini”, we only call
> dma_fence_put(entity->last_scheduled” under the condition of “If
> (entity->fini_status)”, so
>
> This way there is memory leak for the case of “entity->fini_stats ==0”
>
> Good catch, we indeed should fix that.
>
>
> 1.Drm_sched_entity_fini(): it exit right after entity->job_queue
> empty, [ but that time scheduler is not fast enough to deal with
> this entity now ]
>
> That should never happen.
>
> The last job from the entity->job_queue is only removed after the
> scheduler is done with the entity (at least that was the original
> idea, not sure if that still works as expected).
>
> Regards,
> Christian.
>
> Am 18.04.2018 um 09:20 schrieb Liu, Monk:
>
> *Correctio for the scenario *
>
> After we move fence_put(entity->last_sched) out of the fini_status
> check:
>
> A potential race issue for the scenario:
>
> 1.Drm_sched_entity_fini(): it exit right after entity->job_queue
> empty, [ but that time scheduler is not fast enough to deal with
> this entity now ]
>
> 2.Drm_sched_entity_cleanup() : it call
> dma_fence_put(entity->last_scheduled) [ but this time
> entity->last_scheduled actually points to the fence prior to the
> real last one ]
>
> 3.Scheduler_main() now dealing with this entity: it call
> dma_fence_put(entity->last_scheduled) [ Now this fence get
> double put !!! ]
>
> 4.Scheduler_main() now call dma_fence_get() on the *real* last one !
>
> So eventually the real last one fence triggers memory leak and
> more critical the double put fence cause NULL pointer access
>
> /Monk
>
> *From:*Liu, Monk
> *Sent:* 2018年4月18日15:11
> *To:* Koenig, Christian <Christian.Koenig at amd.com>
> <mailto:Christian.Koenig at amd.com>; Deng, Emily
> <Emily.Deng at amd.com> <mailto:Emily.Deng at amd.com>
> *Cc:* amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> *Subject:* force app kill patch
>
> Hi Christian & Emily
>
> I think the v4 fix for “fix force app kill hang”is still not good
> enough:
>
> First:
>
> See that in “sched_entity_fini”, we only call
> dma_fence_put(entity->last_scheduled”under the condition of “If
> (entity->fini_status)”, so
>
> This way there is memory leak for the case of “entity->fini_stats ==0”
>
> Second:
>
> If we move dma_fence_put(entity->last_scheduled) out of the
> condition of “if (entity->fini_status)”, the memory leak issue can
> be fixed
>
> But there will be kernel NULL pointer access, I think the time you
> call dma_fence_put(entity->last_scheduled”) may actually executed
> **not**
>
> On the last scheduled fence of this entity, because it is run
> without “thread_park/unpark”pair which to make sure scheduler not
> dealing this entity
>
> So with certain race issue, here is the scenario:
>
> 1.scheduler is doing the dma_fence_put() on the 1^st fence,
>
> 2.scheduler set entity->last_scheduled to 1^st fence
>
> 3.now sched_entity_fini() run, and it call dma_fence_put() on
> entity->last_scheduled
>
> 4.now this 1^st fence is actually put double time and the real
> last fence won’t get put by expected
>
> any idea?
>
> /Monk
>
>
>
>
> _______________________________________________
>
> amd-gfx mailing list
>
> amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org>
>
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20180418/b89b9f33/attachment-0001.html>
More information about the amd-gfx
mailing list