TDR and VRAM lost handling in KMD (v2)
michel at daenzer.net
Thu Oct 12 09:35:42 UTC 2017
On 12/10/17 11:23 AM, Christian König wrote:
> Am 12.10.2017 um 11:10 schrieb Nicolai Hähnle:
>> On 12.10.2017 10:49, Christian König wrote:
>>>> However, !guilty && ctx->reset_counter != adev->reset_counter does
>>>> not imply that the context was lost.
>>>> The way I understand it, we should return AMDGPU_CTX_INNOCENT_RESET
>>>> if !guilty && ctx->vram_lost_counter != adev->vram_lost_counter.
>>>> As far as I understand it, the case of !guilty && ctx->reset_counter
>>>> != adev->reset_counter && ctx->vram_lost_counter ==
>>>> adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET, because a
>>>> GPU reset occurred, but it didn't affect our context.
>>> I disagree on that.
>>> AMDGPU_CTX_INNOCENT_RESET just means what it does currently, there
>>> was a reset but we haven't been causing it.
>>> That the OpenGL extension is specified otherwise is unfortunate, but
>>> I think we shouldn't use that for the kernel interface here.
>> Two counterpoints:
>> 1. Why should any application care that there was a reset while it was
>> idle? The OpenGL behavior is what makes sense.
> The application is certainly not interest if a reset happened or not,
> but I though that the driver stack might be.
>> 2. AMDGPU_CTX_INNOCENT_RESET doesn't actually mean anything today
>> because we never return it :)
> Good point.
>> amdgpu_ctx_query only ever returns AMDGPU_CTX_UNKNOWN_RESET, which is
>> in line with the OpenGL spec: we're conservatively returning that a
>> reset happened because we don't know whether the context was affected,
>> and we return UNKNOWN because we also don't know whether the context
>> was guilty or not.
>> Returning AMDGPU_CTX_NO_RESET in the case of !guilty &&
>> ctx->vram_lost_counter == adev->vram_lost_counter is simply a
>> refinement and improvement of the current, overly conservative behavior.
> Ok let's reenumerate what I think the different return values should mean:
> * AMDGPU_CTX_GUILTY_RESET
> guilty is set to true for this context.
> * AMDGPU_CTX_INNOCENT_RESET
> guilty is not set and vram_lost_counter has changed.
> * AMDGPU_CTX_UNKNOWN_RESET
> guilty is not set and vram_lost_counter has not changed, but
> gpu_reset_counter has changed.
I don't understand the distinction you're proposing between
AMDGPU_CTX_INNOCENT_RESET and AMDGPU_CTX_UNKNOWN_RESET. I think both
cases you're describing should return either AMDGPU_CTX_INNOCENT_RESET,
if the value of guilty is reliable, or AMDGPU_CTX_UNKNOWN_RESET if it's not.
Earthling Michel Dänzer | http://www.amd.com
Libre software enthusiast | Mesa and X developer
More information about the amd-gfx