<!DOCTYPE html><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
Am 26.11.24 um 07:38 schrieb Raag Jadav:<br>
<blockquote type="cite" cite="mid:Z0VtA5o2cW_snZbf@black.fi.intel.com">
<pre class="moz-quote-pre" wrap="">On Mon, Nov 25, 2024 at 10:32:42AM +0100, Christian König wrote:
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">Am 22.11.24 um 17:02 schrieb Raag Jadav:
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">On Fri, Nov 22, 2024 at 11:09:32AM +0100, Christian König wrote:
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">Am 22.11.24 um 08:07 schrieb Raag Jadav:
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">On Mon, Nov 18, 2024 at 08:26:37PM +0530, Aravind Iddamsetty wrote:
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">On 15/11/24 10:37, Raag Jadav wrote:
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">Introduce device wedged event, which notifies userspace of 'wedged'
(hanged/unusable) state of the DRM device through a uevent. This is
useful especially in cases where the device is no longer operating as
expected and has become unrecoverable from driver context. Purpose of
this implementation is to provide drivers a generic way to recover with
the help of userspace intervention without taking any drastic measures
in the driver.
A 'wedged' device is basically a dead device that needs attention. The
uevent is the notification that is sent to userspace along with a hint
about what could possibly be attempted to recover the device and bring
it back to usable state. Different drivers may have different ideas of
a 'wedged' device depending on their hardware implementation, and hence
the vendor agnostic nature of the event. It is up to the drivers to
decide when they see the need for recovery and how they want to recover
from the available methods.
Prerequisites
-------------
The driver, before opting for recovery, needs to make sure that the
'wedged' device doesn't harm the system as a whole by taking care of the
prerequisites. Necessary actions must include disabling DMA to system
memory as well as any communication channels with other devices. Further,
the driver must ensure that all dma_fences are signalled and any device
state that the core kernel might depend on are cleaned up. Once the event
is sent, the device must be kept in 'wedged' state until the recovery is
performed. New accesses to the device (IOCTLs) should be blocked,
preferably with an error code that resembles the type of failure the
device has encountered. This will signify the reason for wegeding which
can be reported to the application if needed.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">should we even drop the mmaps we created?
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">Whatever is required for a clean recovery, yes.
Although how would this play out? Do we risk loosing display?
Or any other possible side-effects?
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">Before sending a wedge event all DMA transfers of the device have to be
blocked.
So yes, all display, mmap() and file descriptor connections you had with the
device would need to be re-created.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">Does it mean we'd have to rely on userspace to unmap()?
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Yes and no :)
The handling should be similar to how hotplug is handled. E.g. the device
becomes inaccessible by normal applications all mappings become invalid.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Isn't that just unbind (which is already part of recovery)?</pre>
</blockquote>
<br>
No, unbind just invalidates all mappings but it doesn't catches any
page faults which would validate them again.<br>
<br>
The driver or framework must make sure that page faults now get
redirected to a dummy page. See ttm_bo_vm_dummy_page() for how TTM
handles that for all drivers using it.<br>
<br>
Not sure about i915, since it never deals with device memory it can
potentially just keep the access to the allocated system memory
intact.<br>
<br>
<span style="white-space: pre-wrap">
</span>
<blockquote type="cite" cite="mid:Z0VtA5o2cW_snZbf@black.fi.intel.com">
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">But we don't send a SIGBUS or similar on access, instead all mappings
redirected to a dummy page which basically shallows all writes and gives
undefined data on reads.
On IOCTLs the applications should get an error code and eventually restart
or at least unmap all their mappings.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Thanks for the detailed explanation.
Rethinking about this, the criteria set for prerequisites is to not do
anything that could possibly harm the system. So I think the important
question is,
with fences signalled and ioctls already blocked, is live mmap on a wedged
device capable of producing harmful behaviour or unintended side-effects
(atleast until the application has the opportunity to unmap() as part of
recovery)?</pre>
</blockquote>
<br>
I think we are already rather good there.<br>
<br>
The potential options are to redirect everything to a dummy page or
to crash the application by sending a SIGBUS.<br>
<br>
Redirecting everything to the dummy page sounds like the more
defensive approach.<br>
<br>
Regards,<br>
Christian.<br>
<br>
<blockquote type="cite" cite="mid:Z0VtA5o2cW_snZbf@black.fi.intel.com">
<pre class="moz-quote-pre" wrap="">
Raag
</pre>
</blockquote>
<br>
</body>
</html>