[PATCH 3/3] drm/amdgpu: Refresh rq selection for job after ASIC reset
Grodzovsky, Andrey
Andrey.Grodzovsky at amd.com
Fri Oct 19 15:06:36 UTC 2018
On 10/19/2018 03:08 AM, Koenig, Christian wrote:
> Am 18.10.18 um 20:44 schrieb Andrey Grodzovsky:
>> A ring might become unusable after reset, if that the case
>> drm_sched_entity_select_rq will choose another, working rq
>> to run the job if there is one.
>> Also, skip recovery of ring which is not ready.
> Well that is not even remotely sufficient.
>
> If we can't bring up a ring any more after a reset we would need to move
> all jobs which where previously scheduled on it and all of its entities
> to a different instance.
That one should be easy to add inside amdgpu_device_gpu_recover in case
ring is dead, we just do the same for all the jobs in
sched->ring_mirror_list of the dead ring
before doing recovery for them, no?
>
> What you do here breaks dependencies between jobs and can result in
> unforeseen consequences including random memory writes.
Can you explain this a bit more ? AFAIK any job dependent on this job
will still wait for it's completion
before running regardless of did this job moved to a different ring.
What am I missing ?
>
> As far as I can see that can't be done correctly with the current
> scheduler design.
>
> Additional to that when we can't restart one instance of a ring we
> usually can't restart all others either. So the whole approach is rather
> pointless.
From my testing looks like we can, compute ring 0 is dead but IB tests
pass on other compute rings.
Andrey
>
> Regards,
> Christian.
>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +++++++++++++++-
>> 1 file changed, 15 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index d11489e..3124ca1 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -3355,10 +3355,24 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>> else
>> r = amdgpu_device_reset(adev);
>>
>> + /*
>> + * After reboot a ring might fail in which case this will
>> + * move the job to different rq if possible
>> + */
>> + if (job) {
>> + drm_sched_entity_select_rq(job->base.entity);
>> + if (job->base.entity->rq) {
>> + job->base.sched = job->base.entity->rq->sched;
>> + } else {
>> + job->base.sched = NULL;
>> + r = -ENOENT;
>> + }
>> + }
>> +
>> for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>> struct amdgpu_ring *ring = adev->rings[i];
>>
>> - if (!ring || !ring->sched.thread)
>> + if (!ring || !ring->ready || !ring->sched.thread)
>> continue;
>>
>> /* only need recovery sched of the given job's ring
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
More information about the amd-gfx
mailing list