[PATCH 0/5] Patch serials to implement guilty ctx/entity for SRIOV TDR

Mon May 1 07:22:46 UTC 2017

sometime user space submits bad command steam to kernel and with current scheme
gpu-scheduler will always resubmit all un-signaled job to hw ring after gpu reset
thus this bad submit will infinitly trigger GPU hang.

this patch serials implement a system called guilty context, which can avoid submitting
malicious jobs and invalidate the related context behind them, that way the regular
application can still continue to run, and other VF can also suffer less GPU time reductions

the guilty charge is simple: if a job hang too much times exceeds the threshold, we
consider it guilty, and we invalidates the context behind it, and pop out all job in
its entities of each scheduler. the next IOCTL on this CTX handler will get -ENODEV
error thus UMD can know this context is released by driver due to its malicious 
command submit.

Monk Liu (5):
  drm/amdgpu:keep ctx alive till all job finished
  drm/amdgpu:some modifications in amdgpu_ctx
  drm/amdgpu:Impl guilty ctx feature for sriov TDR
  drm/amdgpu:change sriov_gpu_reset interface
  drm/amdgpu:sriov TDR only recover hang ring

 drivers/gpu/drm/amd/amdgpu/amdgpu.h           | 12 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 26 ++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 39 ++++++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 43 ++++++++++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |  3 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c     |  6 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 30 +++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h      |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h      |  2 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c         |  2 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c         |  2 +-
 drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 87 ++++++++++++++++++++++++---
 drivers/gpu/drm/amd/scheduler/gpu_scheduler.h |  3 +
 13 files changed, 209 insertions(+), 47 deletions(-)

-- 
2.7.4