System error

1577332900 1577332900 at qq.com
Tue Sep 3 12:31:55 UTC 2019


Hi ALL,
       Some processes transfer to D status. This stack is:
 
#0 [ffff0001343e3a40] __switch_to at ffff000008088870
    /usr/src/linux-4.19.36-1.2.159.aarch64/arch/arm64/kernel/process.c: 491
#1 [ffff0001343e3a60] __schedule at ffff000008bf8508
    /usr/src/linux-4.19.36-1.2.159.aarch64/kernel/sched/core.c: 2851
#2 [ffff0001343e3af0] schedule at ffff000008bf8be8
    /usr/src/linux-4.19.36-1.2.159.aarch64/kernel/sched/core.c: 3543
#3 [ffff0001343e3b00] drm_sched_entity_flush at ffff000000ce6054 [gpu_sched]
    /usr/src/linux-4.19.36-1.2.159.aarch64/drivers/gpu/drm/scheduler/sched_entity.c: 187     ----3 
 #4 [ffff0001343e3b70] drm_sched_entity_destroy at ffff000000ce6430 [gpu_sched]
    /usr/src/linux-4.19.36-1.2.159.aarch64/drivers/gpu/drm/scheduler/sched_entity.c: 317
#5 [ffff0001343e3b90] amdgpu_vm_fini at ffff0000019f8054 [amdgpu]
    /usr/src/linux-4.19.36-1.2.159.aarch64/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c: 2883
#6 [ffff0001343e3c20] amdgpu_driver_postclose_kms at ffff0000019c856c [amdgpu]
    /usr/src/linux-4.19.36-1.2.159.aarch64/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c: 993
#7 [ffff0001343e3c90] drm_file_free at ffff000000fe44dc [drm]
    /usr/src/linux-4.19.36-1.2.159.aarch64/drivers/gpu/drm/drm_file.c: 254
#8 [ffff0001343e3cf0] drm_release at ffff000000fe4bd4 [drm]
    /usr/src/linux-4.19.36-1.2.159.aarch64/drivers/gpu/drm/drm_file.c: 215
#9 [ffff0001343e3d40] __fput at ffff000008338368    ----2
    /usr/src/linux-4.19.36-1.2.159.aarch64/fs/file_table.c: 278
#10 [ffff0001343e3d90] delayed_fput at ffff000008338514
    /usr/src/linux-4.19.36-1.2.159.aarch64/fs/file_table.c: 304
#11 [ffff0001343e3db0] process_one_work at ffff00000810e7e0
    /usr/src/linux-4.19.36-1.2.159.aarch64/kernel/workqueue.c: 2153
#12 [ffff0001343e3e00] worker_thread at ffff00000810ec60  ----1
    /usr/src/linux-4.19.36-1.2.159.aarch64/kernel/workqueue.c: 2212
#13 [ffff0001343e3e70] kthread at ffff000008115e60


Kernel delay task begin to call drm_release when the drm file is not be used.
But in 3) point,the function do not run to wait_event_timeout().
This codes is:
long drm_sched_entity_flush(struct drm_sched_entity *entity, long timeout)
{
                struct drm_gpu_scheduler *sched;
                struct task_struct *last_user;
                long ret = timeout;


                if (!entity->rq)
                                return 0;


                sched = entity->rq->sched;
                /**
                * The client will not queue more IBs during this fini, consume existing
                * queued IBs or discard them on SIGKILL
                */
                if (current->flags & PF_EXITING) {      ------ When the process is kernel task such as delay task, then it will not run to this codes. But applicational process is exited .
                                if (timeout)
                                                ret = wait_event_timeout(
                                                                                sched->job_scheduled,
                                                                                drm_sched_entity_is_idle(entity),
                                                                                timeout);
                } else {
                                wait_event_killable(sched->job_scheduled,
                                                                    drm_sched_entity_is_idle(entity));
                }


So when When the current process is asynchronous kernel task such as delay task, then it will not run to this codes. But application process is exited.


Why drm_sched_entity_flush function do not  check the case (asynchronous kernel thread call drm_sched_entity_flush function, but app is already exited.)
Can I add check asynchronous kernel thread codes, then call  wait_event_timeout and drm_sched_rq_remove_entity functions?
Thanks.


Remarks:
The same process stack:
# cat /proc/336121/stack
[<0>] __switch_to+0x94/0xe8bt
[<0>] drm_sched_entity_flush+0xf8/0x248 [gpu_sched]
[<0>] amdgpu_ctx_mgr_entity_flush+0xac/0x148 [amdgpu]
[<0>] amdgpu_flush+0x2c/0x50 [amdgpu]
[<0>] filp_close+0x40/0xa0
[<0>] put_files_struct+0x118/0x120
[<0>] put_files_struct+0x30/0x68 [binder_linux]
[<0>] binder_deferred_func+0x4d4/0x658 [binder_linux]
[<0>] process_one_work+0x1b4/0x3f8
[<0>] worker_thread+0x54/0x470
[<0>] kthread+0x134/0x138b
[<0>] ret_from_fork+0x10/0x18
[<0>] 0xffffffffffffffff
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20190903/d2a5370c/attachment-0001.html>


More information about the amd-gfx mailing list