[PATCH 0/5] Patch serials to implement guilty ctx/entity for SRIOV TDR

Christian König deathsimple at vodafone.de
Mon May 1 14:53:54 UTC 2017


Am 01.05.2017 um 09:22 schrieb Monk Liu:
> sometime user space submits bad command steam to kernel and with current scheme
> gpu-scheduler will always resubmit all un-signaled job to hw ring after gpu reset
> thus this bad submit will infinitly trigger GPU hang.
>
> this patch serials implement a system called guilty context, which can avoid submitting
> malicious jobs and invalidate the related context behind them, that way the regular
> application can still continue to run, and other VF can also suffer less GPU time reductions
>
> the guilty charge is simple: if a job hang too much times exceeds the threshold, we
> consider it guilty, and we invalidates the context behind it, and pop out all job in
> its entities of each scheduler. the next IOCTL on this CTX handler will get -ENODEV
> error thus UMD can know this context is released by driver due to its malicious
> command submit.

NAK to the whole approach. That would require that CTX are kept alive 
until all jobs in them are finished which is a NO-GO for resource 
management.

A process which is killed should release all of it's resources as fast 
as possible and not block for the last GPU command to finish for that 
(only what the last GPU command is using shoul be kept alive).

Instead build the whole thing around the fence status. If a job is 
guilty we note that inside the fence status field.

Then on the context query we check the status of the pending fences and 
can judge if a context is guilty or not.

Regards,
Christian.

>
> Monk Liu (5):
>    drm/amdgpu:keep ctx alive till all job finished
>    drm/amdgpu:some modifications in amdgpu_ctx
>    drm/amdgpu:Impl guilty ctx feature for sriov TDR
>    drm/amdgpu:change sriov_gpu_reset interface
>    drm/amdgpu:sriov TDR only recover hang ring
>
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h           | 12 +++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 26 ++++----
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 39 ++++++++++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 43 ++++++++++---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |  3 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c     |  6 ++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 30 +++++++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h      |  1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h      |  2 +-
>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c         |  2 +-
>   drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c         |  2 +-
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 87 ++++++++++++++++++++++++---
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.h |  3 +
>   13 files changed, 209 insertions(+), 47 deletions(-)
>



More information about the amd-gfx mailing list