TDR and VRAM lost handling in KMD:
Nicolai Hähnle
nicolai.haehnle at amd.com
Wed Oct 11 09:16:57 UTC 2017
On 11.10.2017 11:02, Christian König wrote:
>> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all
>> their fence status to “*ECANCELED*”
>>
>> Setting ECANCELED should be ok. But I think we should do this when we
>> try to run the jobs and not during GPU reset.
>>
>> [ML] without deep thought and expritment, I’m not sure the difference
>> between them, but kick it out in gpu_reset routine is more efficient,
>>
> I really don't think so. Kicking them out during gpu_reset sounds racy
> to me once more.
>
> And marking them canceled when we try to run them has the clear
> advantage that all dependencies are meet first.
This makes sense to me as well.
It raises a vaguely related question: What happens to jobs whose
dependencies were canceled? I believe we currently don't check those
errors, so we might execute them anyway if their contexts were
unaffected by the reset. There's a risk that the job will hang due to
stale data.
I don't think it's a huge risk in practice today because we don't have a
lot of buffer sharing between applications, but it's something to think
through at some point. In a way, canceling out of an abundance of
caution may be a bad idea because it could kill a compositor's task by
being overly conservative.
Cheers,
Nicolai
More information about the amd-gfx
mailing list