<!DOCTYPE html><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
Am 16.12.24 um 14:36 schrieb Lazar, Lijo:<br>
<blockquote type="cite" cite="mid:6028b434-2be7-453a-9be8-bf2e85c0756f@amd.com"><span style="white-space: pre-wrap">
</span>
<blockquote type="cite">
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">
</pre>
<blockquote type="cite">
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">I had asked earlier about the utility of this one here. If this is just
to inform userspace that driver has done a reset and recovered, it
would
need some additional context also. We have a mechanism in KFD which
sends the context in which a reset has to be done. Currently, that's
restricted to compute applications, but if this is in a similar
line, we
would like to pass some additional info like job timeout, RAS error
etc.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a
reset and recovered, but additional data about like which job
timeout, RAS error and such belong to devcoredump I guess, where all
data is gathered and collected later.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
I think somebody else mentioned it as well that the source of the
issue, e.g. the PID of the submitting process would be helpful as well
for supervising daemons which need to restart processes when they
caused some issue.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
It was me :) we have a use case that we would need the PID for the
daemon indeed, but the daemon doesn't need to know what's the RAS error
or the job name that timeouted, there's no immediate action to be taken
with this information, contrary to the PID that we need to know.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Regarding devcoredump - it's not done every time. For ex: RAS errors
have a different way to identify the source of error, hence we don't
need a coredump in such cases.
The intention is only to let the user know the reason for reset at a
high level, and probably add more things later like the engines or
queues that have reset etc.</pre>
</blockquote>
<br>
Well what is the use case for that? That doesn't looks valuable to
me.<br>
<br>
RAS errors should generally be reported to the application who
issued the submission.<br>
<br>
As a system wide event they are only useful in things like logfiles
I think.<br>
<br>
Regards,<br>
Christian.<br>
<br>
<blockquote type="cite" cite="mid:6028b434-2be7-453a-9be8-bf2e85c0756f@amd.com">
<pre class="moz-quote-pre" wrap="">
Thanks,
Lijo
</pre>
<blockquote type="cite">
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">We just postponed adding that till later.
Regards,
Christian.
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">Thanks,
Lijo
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">Regards,
Christian.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
</pre>
</blockquote>
<br>
</body>
</html>