[PATCH 3/3] drm/amdgpu: Refresh rq selection for job after ASIC reset
Christian König
ckoenig.leichtzumerken at gmail.com
Fri Oct 19 16:28:25 UTC 2018
Am 19.10.18 um 17:06 schrieb Grodzovsky, Andrey:
>
> On 10/19/2018 03:08 AM, Koenig, Christian wrote:
>> Am 18.10.18 um 20:44 schrieb Andrey Grodzovsky:
>>> A ring might become unusable after reset, if that the case
>>> drm_sched_entity_select_rq will choose another, working rq
>>> to run the job if there is one.
>>> Also, skip recovery of ring which is not ready.
>> Well that is not even remotely sufficient.
>>
>> If we can't bring up a ring any more after a reset we would need to move
>> all jobs which where previously scheduled on it and all of its entities
>> to a different instance.
> That one should be easy to add inside amdgpu_device_gpu_recover in case
> ring is dead, we just do the same for all the jobs in
> sched->ring_mirror_list of the dead ring
> before doing recovery for them, no?
Correct, you need to execute all jobs from the ring_mirror_list as well
as move all entities to other rqs.
But there are some problem with that see below.
>> What you do here breaks dependencies between jobs and can result in
>> unforeseen consequences including random memory writes.
> Can you explain this a bit more ? AFAIK any job dependent on this job
> will still wait for it's completion
> before running regardless of did this job moved to a different ring.
> What am I missing ?
The stuff dependent on the moving jobs should indeed work correctly.
But you don't necessary have space to execute the ring_mirror_list on
another scheduler. And to that the jobs are prepared to run on a
specific ring, e.g. UVD jobs are patches, VMIDs assigned etc etc...
So that most likely won't work correctly.
>> As far as I can see that can't be done correctly with the current
>> scheduler design.
>>
>> Additional to that when we can't restart one instance of a ring we
>> usually can't restart all others either. So the whole approach is rather
>> pointless.
> From my testing looks like we can, compute ring 0 is dead but IB tests
> pass on other compute rings.
Interesting, but I would rather investigate why compute ring 0 is dead
while other still work.
At least some compute rings should be handled by the same engine, so if
one is dead all other should be dead as well.
Christian.
>
> Andrey
>
>> Regards,
>> Christian.
>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +++++++++++++++-
>>> 1 file changed, 15 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index d11489e..3124ca1 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -3355,10 +3355,24 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>> else
>>> r = amdgpu_device_reset(adev);
>>>
>>> + /*
>>> + * After reboot a ring might fail in which case this will
>>> + * move the job to different rq if possible
>>> + */
>>> + if (job) {
>>> + drm_sched_entity_select_rq(job->base.entity);
>>> + if (job->base.entity->rq) {
>>> + job->base.sched = job->base.entity->rq->sched;
>>> + } else {
>>> + job->base.sched = NULL;
>>> + r = -ENOENT;
>>> + }
>>> + }
>>> +
>>> for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>> struct amdgpu_ring *ring = adev->rings[i];
>>>
>>> - if (!ring || !ring->sched.thread)
>>> + if (!ring || !ring->ready || !ring->sched.thread)
>>> continue;
>>>
>>> /* only need recovery sched of the given job's ring
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
More information about the amd-gfx
mailing list