TDR and VRAM lost handling in KMD (v2)

Michel Dänzer michel at daenzer.net
Thu Oct 12 09:35:42 UTC 2017


On 12/10/17 11:23 AM, Christian König wrote:
> Am 12.10.2017 um 11:10 schrieb Nicolai Hähnle:
>> On 12.10.2017 10:49, Christian König wrote:
>>>> However, !guilty && ctx->reset_counter != adev->reset_counter does
>>>> not imply that the context was lost.
>>>>
>>>> The way I understand it, we should return AMDGPU_CTX_INNOCENT_RESET
>>>> if !guilty && ctx->vram_lost_counter != adev->vram_lost_counter.
>>>>
>>>> As far as I understand it, the case of !guilty && ctx->reset_counter
>>>> != adev->reset_counter && ctx->vram_lost_counter ==
>>>> adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET, because a
>>>> GPU reset occurred, but it didn't affect our context.
>>> I disagree on that.
>>>
>>> AMDGPU_CTX_INNOCENT_RESET just means what it does currently, there
>>> was a reset but we haven't been causing it.
>>>
>>> That the OpenGL extension is specified otherwise is unfortunate, but
>>> I think we shouldn't use that for the kernel interface here.
>> Two counterpoints:
>>
>> 1. Why should any application care that there was a reset while it was
>> idle? The OpenGL behavior is what makes sense.
> 
> The application is certainly not interest if a reset happened or not,
> but I though that the driver stack might be.
> 
>>
>> 2. AMDGPU_CTX_INNOCENT_RESET doesn't actually mean anything today
>> because we never return it :)
>>
> 
> Good point.
> 
>> amdgpu_ctx_query only ever returns AMDGPU_CTX_UNKNOWN_RESET, which is
>> in line with the OpenGL spec: we're conservatively returning that a
>> reset happened because we don't know whether the context was affected,
>> and we return UNKNOWN because we also don't know whether the context
>> was guilty or not.
>>
>> Returning AMDGPU_CTX_NO_RESET in the case of !guilty &&
>> ctx->vram_lost_counter == adev->vram_lost_counter is simply a
>> refinement and improvement of the current, overly conservative behavior.
> 
> Ok let's reenumerate what I think the different return values should mean:
> 
> * AMDGPU_CTX_GUILTY_RESET
> 
> guilty is set to true for this context.
> 
> * AMDGPU_CTX_INNOCENT_RESET
> 
> guilty is not set and vram_lost_counter has changed.
> 
> * AMDGPU_CTX_UNKNOWN_RESET
> 
> guilty is not set and vram_lost_counter has not changed, but
> gpu_reset_counter has changed.

I don't understand the distinction you're proposing between
AMDGPU_CTX_INNOCENT_RESET and AMDGPU_CTX_UNKNOWN_RESET. I think both
cases you're describing should return either AMDGPU_CTX_INNOCENT_RESET,
if the value of guilty is reliable, or AMDGPU_CTX_UNKNOWN_RESET if it's not.


-- 
Earthling Michel Dänzer               |               http://www.amd.com
Libre software enthusiast             |             Mesa and X developer


More information about the amd-gfx mailing list