[PATCH] drm/amdkfd: Restore all process on post reset

Felix Kuehling felix.kuehling at amd.com
Tue Aug 3 15:38:56 UTC 2021


Am 2021-08-03 um 11:02 a.m. schrieb Eric Huang:
>
>
> On 2021-07-30 5:26 p.m., Felix Kuehling wrote:
>> Am 2021-07-28 um 1:31 p.m. schrieb Eric Huang:
>>> It is to fix a bug of gpu_recovery on multiple GPUs,
>>> When one gpu is reset, the application running on other
>>> gpu hangs, because kfd post reset doesn't restore the
>>> running process.
>> This will resume all processes, even those that were affected by the GPU
>> reset.
>>
>> The assumption we made here is, that KFD processes can use all GPUs. So
>> when one GPU is reset, all processes are affected. If we want to refine
>> that, we'll need some more complex logic to identify the processes
>> affected by a GPU reset and keep only those processes suspended. This
>> could be based on the GPUs that a process has created queues on, or
>> allocated memory on.
>>
>> What we don't want, is processes continuing with bad data or
>> inconsistent state after a GPU reset.
>>
> Current code doesn't take care of this assumption. When a GPU hangs,
> evicting queues will fail on it and roll back to restore all processes
> queues on other GPUs, and continue to running with unclear state and
> data after a GPU reset.

That's a bug.


>
> The original thought about this patch is to call
> kfd_suspend_all_processes and kfd_restore_all_processes in pair on
> pre_reset and post_reset. And It keeps the consistent behavior for
> both amdgpu_gpu_recover and hang_hws.

This is true for suspend/resume, but not for GPU reset. After a GPU
reset we want queues of affected processes to stay evicted.


>>>   And it also fixes a bug in the function
>>> kfd_process_evict_queues, when one gpu hangs, process
>>> running on other gpus can't be evicted.
>> This should be a separate patch. The code you're removing was there as
>> an attempt to make kfd_process_evict_queues transactional. That means,
>> it either succeeds completely or it fails completely. I wanted to avoid
>> putting the system into an unknown state where some queues are suspended
>> and others are not and the caller has no idea how to proceed. So I roll
>> back a partial queue eviction if something failed.
> Can we let the caller to decide if roll-back is needed? because no all
> the callers need to roll back, e.g. kfd_suspend_all_processes and
> kgd2kfd_quiesce_mm.

I don't think so. If suspend partially succeeds, how can the caller do a
partial roll back? The caller doesn't have the information about what
succeeded and what failed.

But you have a good point. It may depend on the caller whether a
rollback is needed or not. In the case of GPU reset, we don't care so
much about the HW state, but we do care about the queue eviction state.
That must be updated in order to prevent queues to resume after the
reset, no matter what else fails. This could be done by adding a "bool
force" flag to kfd_suspend_all_processes and kfd_process_evict_queues.
If the flag is set we don't roll back and we keep going to suspend
everything even after partial failures.

I think kfd_suspend_all_processes should return an error if force=false
and some error occurred so that kgd2kfd_suspend can return a failure. In
the case of suspend/resume that can stop the suspend process.


>>
>> Your changing this to "try to evict as much as possible". Then a failure
>> does not mean "eviction failed" but "eviction completed but something
>> hung". Then the GPU reset can take care of the hanging part. I think
>> that's a reasonable strategy. But we need to be sure that there are no
>> other error conditions (other than hangs) that could cause a queue
>> eviction to fail.
>>
>> There were some recent changes in pqm_destroy_queue that check the
>> return value of dqm->ops.destroy_queue, which indirectly comes from
>> execute_queues (sam as in the eviction case). -ETIME means a hang. Other
>> errors are possible. Those other errors may still need the roll-back.
>> Otherwise we'd leave stuff running on a non-hanging GPU after we
>> promised that we evicted everything.
> I think it depends on case scenario. GPU reset doesn't need to know
> the return state. Memory eviction may need. Does Memory notifier
> invalidate range need?

You're right in the sense, that for a GPU reset we don't care about the
HW state because we're about to reset it anyway. But we do care about
the queue eviction state of the queues (q->properties.is_evicted). After
the reset completes we don't want the queues of affected processes to be
in evicted state so they are not resumed on the hardware.

In the MMU notifier case (kgd2kfd_quiesce_mm) we do care about the HW
state. When the MMU notifier returns to the caller, it promises that the
HW will no longer access the memory. If the GPU is hanging, that's
probably OK as far as the MMU notifier cares. So this is another case
where we don't want to roll back. Other error cases (not GPU hangs)
would be problematic.

Regards,
  Felix


>>
>> See one more comment inline.
>>
>>
>>> Signed-off-by: Eric Huang <jinhuieric.huang at amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdkfd/kfd_device.c  |  2 +-
>>>   drivers/gpu/drm/amd/amdkfd/kfd_process.c | 24
>>> +-----------------------
>>>   2 files changed, 2 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> index 24b5e0aa1eac..daf1c19bd799 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> @@ -984,7 +984,7 @@ int kgd2kfd_post_reset(struct kfd_dev *kfd)
>>>       if (!kfd->init_complete)
>>>           return 0;
>>>   -    ret = kfd_resume(kfd);
>>> +    ret = kgd2kfd_resume(kfd, false, true);
>> Which branch is this for? On amd-staging-drm-next kgd2kfd_resume only
>> has two parameters.
> Sorry, it is based on dkms staging 5.11. I didn't notice there is
> difference between two branches.
>
> Regards,
> Eric
>>
>> Regards,
>>    Felix
>>
>>
>>>       if (ret)
>>>           return ret;
>>>       atomic_dec(&kfd_locked);
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> index 38a9dee40785..9272a12c1db8 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> @@ -1879,36 +1879,14 @@ int kfd_process_evict_queues(struct
>>> kfd_process *p)
>>>   {
>>>       int r = 0;
>>>       int i;
>>> -    unsigned int n_evicted = 0;
>>>         for (i = 0; i < p->n_pdds; i++) {
>>>           struct kfd_process_device *pdd = p->pdds[i];
>>>             r = pdd->dev->dqm->ops.evict_process_queues(pdd->dev->dqm,
>>>                                   &pdd->qpd);
>>> -        if (r) {
>>> +        if (r)
>>>               pr_err("Failed to evict process queues\n");
>>> -            goto fail;
>>> -        }
>>> -        n_evicted++;
>>> -    }
>>> -
>>> -    return r;
>>> -
>>> -fail:
>>> -    /* To keep state consistent, roll back partial eviction by
>>> -     * restoring queues
>>> -     */
>>> -    for (i = 0; i < p->n_pdds; i++) {
>>> -        struct kfd_process_device *pdd = p->pdds[i];
>>> -
>>> -        if (n_evicted == 0)
>>> -            break;
>>> -        if (pdd->dev->dqm->ops.restore_process_queues(pdd->dev->dqm,
>>> -                                  &pdd->qpd))
>>> -            pr_err("Failed to restore queues\n");
>>> -
>>> -        n_evicted--;
>>>       }
>>>         return r;
>


More information about the amd-gfx mailing list