TDR and VRAM lost handling in KMD:

Wed Oct 11 09:16:57 UTC 2017

On 11.10.2017 11:02, Christian König wrote:
>> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all 
>> their fence status to “*ECANCELED*”
>>
>> Setting ECANCELED should be ok. But I think we should do this when we 
>> try to run the jobs and not during GPU reset.
>>
>> [ML] without deep thought and expritment, I’m not sure the difference 
>> between them, but kick it out in gpu_reset routine is more efficient,
>>
> I really don't think so. Kicking them out during gpu_reset sounds racy 
> to me once more.
> 
> And marking them canceled when we try to run them has the clear 
> advantage that all dependencies are meet first.

This makes sense to me as well.

It raises a vaguely related question: What happens to jobs whose 
dependencies were canceled? I believe we currently don't check those 
errors, so we might execute them anyway if their contexts were 
unaffected by the reset. There's a risk that the job will hang due to 
stale data.

I don't think it's a huge risk in practice today because we don't have a 
lot of buffer sharing between applications, but it's something to think 
through at some point. In a way, canceling out of an abundance of 
caution may be a bad idea because it could kill a compositor's task by 
being overly conservative.

Cheers,
Nicolai