[PATCH 3/3] drm/amdgpu: Refresh rq selection for job after ASIC reset
Koenig, Christian
Christian.Koenig at amd.com
Fri Oct 19 07:08:51 UTC 2018
Am 18.10.18 um 20:44 schrieb Andrey Grodzovsky:
> A ring might become unusable after reset, if that the case
> drm_sched_entity_select_rq will choose another, working rq
> to run the job if there is one.
> Also, skip recovery of ring which is not ready.
Well that is not even remotely sufficient.
If we can't bring up a ring any more after a reset we would need to move
all jobs which where previously scheduled on it and all of its entities
to a different instance.
What you do here breaks dependencies between jobs and can result in
unforeseen consequences including random memory writes.
As far as I can see that can't be done correctly with the current
scheduler design.
Additional to that when we can't restart one instance of a ring we
usually can't restart all others either. So the whole approach is rather
pointless.
Regards,
Christian.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +++++++++++++++-
> 1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index d11489e..3124ca1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3355,10 +3355,24 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> else
> r = amdgpu_device_reset(adev);
>
> + /*
> + * After reboot a ring might fail in which case this will
> + * move the job to different rq if possible
> + */
> + if (job) {
> + drm_sched_entity_select_rq(job->base.entity);
> + if (job->base.entity->rq) {
> + job->base.sched = job->base.entity->rq->sched;
> + } else {
> + job->base.sched = NULL;
> + r = -ENOENT;
> + }
> + }
> +
> for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> struct amdgpu_ring *ring = adev->rings[i];
>
> - if (!ring || !ring->sched.thread)
> + if (!ring || !ring->ready || !ring->sched.thread)
> continue;
>
> /* only need recovery sched of the given job's ring
More information about the amd-gfx
mailing list