[PATCH] drm/sched: fix the duplicated TMO message for one IB
ckoenig.leichtzumerken at gmail.com
Thu May 9 13:26:28 UTC 2019
Am 09.05.19 um 14:49 schrieb Liu, Monk:
> I believe even yourself would agree that keep reporting TMO for the same IB is ugly (need put a "gpu_recovery=0" as option ), I can also argue with you that this is a bad design ...
Well exactly that's what I disagree on. A second timeout on the same job
is perfectly possible and desired, we just don't need to necessarily
report it once more.
> Besides you didn't on technique persuade me to believe there will be bad things upcoming with my patch, you even didn't give me a sequence to prove my patch is wrong.
> Since you are the maintainer I would have no choice but keep my patch the customer branch given the case you won't review it, it's only used for AMDGPU virtualization customer anyway
Well, I didn't say your patch is wrong. A NAK doesn't mean general
rejection, but that you should start from the beginning and reiterate
what you are trying to archive here. I mean what does a computer do when
it receives the NAK? It retries the transmission.
But the general problem here is that you are quite often coming up with
individual patches ripped out of the context without explaining the
> For the design of that feature, your feedback is welcomed:
> Customer want sw and hw related information (pretty much infos we can collect by UMR) automatically dumped upon Any IB hang.
Yeah, in general that is a really good idea and discussed multiple times
> - there would be an APP running in background as a daemon utility
> - in the beginning of this daemon it would write "register" to a debugfs file like : /sys/kernel/debug/dri/0/amdgpu_dump_register
> - kernel driver would remember the PID/task of this daemon upon it write "register" to that debugfs file
That is a rather bad idea because it doesn't handle killed daemon
processes correctly. A better approach is let the daemon open up the
virtual file and keep the connection open until the process exits.
> - since GIM driver would also detect GPU hang (which lead to a VF FLR), kernel driver should interacted with GIM to block GIM's FLR until the DUMP finished by that daemon.
> - upon an IB hang, job_timeout() would kick off and inside it KMD would send an signal "USR1" to the registered task,
> - upon "USR1" signal received by the daemon app, it would call "UMR" to dump whatever interested for the offline debugging
Signals are a bad idea for kernel->userspace communication, better use a
pipe where data becomes available on a GPU reset.
> With current design if I pass "gpu_recover=0" to KMD there will be infinite job_timeout() triggered for one IB, thus the auto DUMP would also be crazy about its duplication
But that's not because of the scheduler but rather because your design
is missing another important piece: How does the daemon process signals
the kernel driver that it is done with dumping the state? E.g. How do
you signal the GIM that it can do a FLR again somehow?
What you can do is to either block the timeout handler until the
userspace process signals that it is done with dumping, or if that's not
possible because of locks (etc) you could use the
drm_sched_suspend_timeout() and drm_sched_resume_timeout() helpers which
allow us to disable the scheduler timeout for a moment.
> -----Original Message-----
> From: Koenig, Christian
> Sent: Thursday, May 9, 2019 7:45 PM
> To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
> Subject: Re: [PATCH] drm/sched: fix the duplicated TMO message for one IB
> Well putting the other drivers aside for moment, I would also say that it is bad design to not restart the timeout. So even then I would not review that patch.
> Question is rather what are you actually trying to do and why don't you want to change your design?
> Designs should be discussed on the public mailing list and not done internally.
More information about the amd-gfx