[PATCH 0/5] Patch serials to implement guilty ctx/entity for SRIOV TDR
Monk Liu
Monk.Liu at amd.com
Mon May 1 07:22:46 UTC 2017
sometime user space submits bad command steam to kernel and with current scheme
gpu-scheduler will always resubmit all un-signaled job to hw ring after gpu reset
thus this bad submit will infinitly trigger GPU hang.
this patch serials implement a system called guilty context, which can avoid submitting
malicious jobs and invalidate the related context behind them, that way the regular
application can still continue to run, and other VF can also suffer less GPU time reductions
the guilty charge is simple: if a job hang too much times exceeds the threshold, we
consider it guilty, and we invalidates the context behind it, and pop out all job in
its entities of each scheduler. the next IOCTL on this CTX handler will get -ENODEV
error thus UMD can know this context is released by driver due to its malicious
command submit.
Monk Liu (5):
drm/amdgpu:keep ctx alive till all job finished
drm/amdgpu:some modifications in amdgpu_ctx
drm/amdgpu:Impl guilty ctx feature for sriov TDR
drm/amdgpu:change sriov_gpu_reset interface
drm/amdgpu:sriov TDR only recover hang ring
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 12 +++-
drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 26 ++++----
drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 39 ++++++++++--
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 43 ++++++++++---
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 +
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 6 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 30 +++++++--
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 1 +
drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h | 2 +-
drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 2 +-
drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 2 +-
drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 87 ++++++++++++++++++++++++---
drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 3 +
13 files changed, 209 insertions(+), 47 deletions(-)
--
2.7.4
More information about the amd-gfx
mailing list