[PATCH] drm/sched: fix the duplicated TMO message for one IB

Thu May 9 13:26:28 UTC 2019

Am 09.05.19 um 14:49 schrieb Liu, Monk:
> Christian
>
> I believe even yourself would agree that keep reporting TMO for the same IB is ugly (need put a "gpu_recovery=0" as option ), I can also argue with you that this is a bad design ...

Well exactly that's what I disagree on. A second timeout on the same job 
is perfectly possible and desired, we just don't need to necessarily 
report it once more.

> Besides you didn't on technique persuade me to believe there will be  bad things upcoming with my patch, you even didn't give me a sequence to prove my patch is wrong.
>   
> Since you are the maintainer I would have no choice but keep my patch the customer branch given the case you won't review it, it's only used for AMDGPU virtualization customer  anyway

Well, I didn't say your patch is wrong. A NAK doesn't mean general 
rejection, but that you should start from the beginning and reiterate 
what you are trying to archive here. I mean what does a computer do when 
it receives the NAK? It retries the transmission.

But the general problem here is that you are quite often coming up with 
individual patches ripped out of the context without explaining the 
general intention.

> For the design of that feature, your feedback is welcomed:
>
> Backgrounds:
> Customer want sw and hw related information (pretty much infos we can collect by UMR) automatically dumped upon Any IB hang.

Yeah, in general that is a really good idea and discussed multiple times 
now.

> Design:
> - there would be an APP running in background as a daemon utility
> - in the beginning of this daemon it would write "register" to a debugfs file like : /sys/kernel/debug/dri/0/amdgpu_dump_register
> - kernel driver would remember the PID/task of this daemon upon it write "register" to that debugfs file

That is a rather bad idea because it doesn't handle killed daemon 
processes correctly. A better approach is let the daemon open up the 
virtual file and keep the connection open until the process exits.

> - since GIM driver would also detect GPU hang (which lead to a VF FLR), kernel driver should interacted with GIM to block GIM's FLR until the DUMP finished by that daemon.
> - upon an IB hang, job_timeout() would kick off and inside it KMD would send an signal "USR1" to the registered task,
> - upon "USR1" signal received by the daemon app, it would call "UMR" to dump whatever interested for the offline debugging

Signals are a bad idea for kernel->userspace communication, better use a 
pipe where data becomes available on a GPU reset.

> With current design if I pass "gpu_recover=0" to KMD there will be infinite job_timeout() triggered for one IB, thus the auto DUMP would also be crazy about its duplication

But that's not because of the scheduler but rather because your design 
is missing another important piece: How does the daemon process signals 
the kernel driver that it is done with dumping the state? E.g. How do 
you signal the GIM that it can do a FLR again somehow?

What you can do is to either block the timeout handler until the 
userspace process signals that it is done with dumping, or if that's not 
possible because of locks (etc) you could use the 
drm_sched_suspend_timeout() and drm_sched_resume_timeout() helpers which 
allow us to disable the scheduler timeout for a moment.

Regards,
Christian.

>
> /Monk
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: Thursday, May 9, 2019 7:45 PM
> To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
> Subject: Re: [PATCH] drm/sched: fix the duplicated TMO message for one IB
>
> Well putting the other drivers aside for moment, I would also say that it is bad design to not restart the timeout. So even then I would not review that patch.
>
> Question is rather what are you actually trying to do and why don't you want to change your design?
>
> Designs should be discussed on the public mailing list and not done internally.
>
> Regards,
> Christian.
>