TDR and VRAM lost handling in KMD (v2)

Thu Oct 12 09:44:42 UTC 2017

On 12.10.2017 11:35, Michel Dänzer wrote:
> On 12/10/17 11:23 AM, Christian König wrote:
>> Am 12.10.2017 um 11:10 schrieb Nicolai Hähnle:
>>> On 12.10.2017 10:49, Christian König wrote:
>>>>> However, !guilty && ctx->reset_counter != adev->reset_counter does
>>>>> not imply that the context was lost.
>>>>>
>>>>> The way I understand it, we should return AMDGPU_CTX_INNOCENT_RESET
>>>>> if !guilty && ctx->vram_lost_counter != adev->vram_lost_counter.
>>>>>
>>>>> As far as I understand it, the case of !guilty && ctx->reset_counter
>>>>> != adev->reset_counter && ctx->vram_lost_counter ==
>>>>> adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET, because a
>>>>> GPU reset occurred, but it didn't affect our context.
>>>> I disagree on that.
>>>>
>>>> AMDGPU_CTX_INNOCENT_RESET just means what it does currently, there
>>>> was a reset but we haven't been causing it.
>>>>
>>>> That the OpenGL extension is specified otherwise is unfortunate, but
>>>> I think we shouldn't use that for the kernel interface here.
>>> Two counterpoints:
>>>
>>> 1. Why should any application care that there was a reset while it was
>>> idle? The OpenGL behavior is what makes sense.
>>
>> The application is certainly not interest if a reset happened or not,
>> but I though that the driver stack might be.
>>
>>>
>>> 2. AMDGPU_CTX_INNOCENT_RESET doesn't actually mean anything today
>>> because we never return it :)
>>>
>>
>> Good point.
>>
>>> amdgpu_ctx_query only ever returns AMDGPU_CTX_UNKNOWN_RESET, which is
>>> in line with the OpenGL spec: we're conservatively returning that a
>>> reset happened because we don't know whether the context was affected,
>>> and we return UNKNOWN because we also don't know whether the context
>>> was guilty or not.
>>>
>>> Returning AMDGPU_CTX_NO_RESET in the case of !guilty &&
>>> ctx->vram_lost_counter == adev->vram_lost_counter is simply a
>>> refinement and improvement of the current, overly conservative behavior.
>>
>> Ok let's reenumerate what I think the different return values should mean:
>>
>> * AMDGPU_CTX_GUILTY_RESET
>>
>> guilty is set to true for this context.
>>
>> * AMDGPU_CTX_INNOCENT_RESET
>>
>> guilty is not set and vram_lost_counter has changed.
>>
>> * AMDGPU_CTX_UNKNOWN_RESET
>>
>> guilty is not set and vram_lost_counter has not changed, but
>> gpu_reset_counter has changed.
> 
> I don't understand the distinction you're proposing between
> AMDGPU_CTX_INNOCENT_RESET and AMDGPU_CTX_UNKNOWN_RESET. I think both
> cases you're describing should return either AMDGPU_CTX_INNOCENT_RESET,
> if the value of guilty is reliable, or AMDGPU_CTX_UNKNOWN_RESET if it's not.

I think it'd make more sense if it was called "AMDGPU_CTX_UNAFFECTED_RESET".

So:
- AMDGPU_CTX_GUILTY_RESET --> the context was affected by a reset, and 
we know that it's the context's fault
- AMDGPU_CTX_INNOCENT_RESET --> the context was affected by a reset, and 
we know that it *wasn't* the context's fault (no context job active)
- AMDGPU_CTX_UNKNOWN_RESET --> the context was affected by a reset, and 
we don't know who's responsible (this could be returned in the unlikely 
case where context A's gfx job has not yet finished, but context B's gfx 
job has already started; it could be the fault of A, it could be the 
fault of B -- which somehow manages to hang a part of the hardware that 
then prevents A's job from finishing -- or it could be both; but it's a 
bit academic)
- AMDGPU_CTX_UNAFFECTED_RESET --> there was a reset, but this context 
wasn't affected

This last value would currently just be discarded by Mesa (because we 
should only disturb applications when we have to), but perhaps somebody 
else could find it useful?

Cheers,
Nicolai