TDR and VRAM lost handling in KMD (v2)
Nicolai Hähnle
nicolai.haehnle at amd.com
Thu Oct 12 09:44:42 UTC 2017
On 12.10.2017 11:35, Michel Dänzer wrote:
> On 12/10/17 11:23 AM, Christian König wrote:
>> Am 12.10.2017 um 11:10 schrieb Nicolai Hähnle:
>>> On 12.10.2017 10:49, Christian König wrote:
>>>>> However, !guilty && ctx->reset_counter != adev->reset_counter does
>>>>> not imply that the context was lost.
>>>>>
>>>>> The way I understand it, we should return AMDGPU_CTX_INNOCENT_RESET
>>>>> if !guilty && ctx->vram_lost_counter != adev->vram_lost_counter.
>>>>>
>>>>> As far as I understand it, the case of !guilty && ctx->reset_counter
>>>>> != adev->reset_counter && ctx->vram_lost_counter ==
>>>>> adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET, because a
>>>>> GPU reset occurred, but it didn't affect our context.
>>>> I disagree on that.
>>>>
>>>> AMDGPU_CTX_INNOCENT_RESET just means what it does currently, there
>>>> was a reset but we haven't been causing it.
>>>>
>>>> That the OpenGL extension is specified otherwise is unfortunate, but
>>>> I think we shouldn't use that for the kernel interface here.
>>> Two counterpoints:
>>>
>>> 1. Why should any application care that there was a reset while it was
>>> idle? The OpenGL behavior is what makes sense.
>>
>> The application is certainly not interest if a reset happened or not,
>> but I though that the driver stack might be.
>>
>>>
>>> 2. AMDGPU_CTX_INNOCENT_RESET doesn't actually mean anything today
>>> because we never return it :)
>>>
>>
>> Good point.
>>
>>> amdgpu_ctx_query only ever returns AMDGPU_CTX_UNKNOWN_RESET, which is
>>> in line with the OpenGL spec: we're conservatively returning that a
>>> reset happened because we don't know whether the context was affected,
>>> and we return UNKNOWN because we also don't know whether the context
>>> was guilty or not.
>>>
>>> Returning AMDGPU_CTX_NO_RESET in the case of !guilty &&
>>> ctx->vram_lost_counter == adev->vram_lost_counter is simply a
>>> refinement and improvement of the current, overly conservative behavior.
>>
>> Ok let's reenumerate what I think the different return values should mean:
>>
>> * AMDGPU_CTX_GUILTY_RESET
>>
>> guilty is set to true for this context.
>>
>> * AMDGPU_CTX_INNOCENT_RESET
>>
>> guilty is not set and vram_lost_counter has changed.
>>
>> * AMDGPU_CTX_UNKNOWN_RESET
>>
>> guilty is not set and vram_lost_counter has not changed, but
>> gpu_reset_counter has changed.
>
> I don't understand the distinction you're proposing between
> AMDGPU_CTX_INNOCENT_RESET and AMDGPU_CTX_UNKNOWN_RESET. I think both
> cases you're describing should return either AMDGPU_CTX_INNOCENT_RESET,
> if the value of guilty is reliable, or AMDGPU_CTX_UNKNOWN_RESET if it's not.
I think it'd make more sense if it was called "AMDGPU_CTX_UNAFFECTED_RESET".
So:
- AMDGPU_CTX_GUILTY_RESET --> the context was affected by a reset, and
we know that it's the context's fault
- AMDGPU_CTX_INNOCENT_RESET --> the context was affected by a reset, and
we know that it *wasn't* the context's fault (no context job active)
- AMDGPU_CTX_UNKNOWN_RESET --> the context was affected by a reset, and
we don't know who's responsible (this could be returned in the unlikely
case where context A's gfx job has not yet finished, but context B's gfx
job has already started; it could be the fault of A, it could be the
fault of B -- which somehow manages to hang a part of the hardware that
then prevents A's job from finishing -- or it could be both; but it's a
bit academic)
- AMDGPU_CTX_UNAFFECTED_RESET --> there was a reset, but this context
wasn't affected
This last value would currently just be discarded by Mesa (because we
should only disturb applications when we have to), but perhaps somebody
else could find it useful?
Cheers,
Nicolai
More information about the amd-gfx
mailing list