<!DOCTYPE html><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
Executive Summary: We need to add CRIU support to DRM render nodes
in order to maintain CRIU support for ROCm application once they
start relying on render nodes for more GPU memory management. In
this email I'm providing some background why we are doing this, and
outlining some of the problems we need to solve to checkpoint and
restore render node state and shared memory (DMABuf) state. I have
some thoughts on the API design, leaning on what we did for KFD, but
would like to get feedback from the DRI community regarding that API
and to what extent there is interest in making that generic.<br>
<p>We are working on using DRM render nodes for virtual address
mappings in ROCm applications to implement the CUDA11-style VM API
and improve interoperability between graphics and compute. This
uses DMABufs for sharing buffer objects between KFD and multiple
render node devices, as well as between processes. In the long run
this also provides a path to moving all or most memory management
from the KFD ioctl API to libdrm.</p>
<p>Once ROCm user mode starts using render nodes for virtual address
management, that creates a problem for checkpointing and restoring
ROCm applications with CRIU. Currently there is no support for
checkpointing and restoring render node state, other than CPU
virtual address mappings. Support will be needed for checkpointing
GEM buffer objects and handles, their GPU virtual address mappings
and memory sharing relationships between devices and processes.</p>
<p>Eventually, if full CRIU support for graphics applications is
desired, more state would need to be captured, including scheduler
contexts and BO lists. Most of this state is driver-specific.</p>
<p>After some internal discussions we decided to take our design
process public as this potentially touches DRM GEM and DMABuf APIs
and may have implications for other drivers in the future.</p>
<p>One basic question before going into any API details: Is there a
desire to have CRIU support for other DRM drivers?</p>
<p>With that out of the way, some considerations for a possible DRM
CRIU API (either generic of AMDGPU driver specific): The API goes
through several phases during checkpoint and restore:</p>
<p>Checkpoint:</p>
<ol>
<li>Process-info (enumerates objects and sizes so user mode can
allocate memory for the checkpoint, stops execution on the GPU)</li>
<li>Checkpoint (store object metadata for BOs, queues, etc.)</li>
<li>Unpause (resumes execution after the checkpoint is complete)</li>
</ol>
<p>Restore:</p>
<ol>
<li>Restore (restore objects, VMAs are not in the right place at
this time)</li>
<li>Resume (final fixups after the VMAs are sorted out, resume
execution)</li>
</ol>
<p>For some more background about our implementation in KFD, you can
refer to this whitepaper:
<a class="moz-txt-link-freetext" href="https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md" moz-do-not-send="true">
https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md</a><br>
</p>
<p>Potential objections to a KFD-style CRIU API in DRM render nodes,
I'll address each of them in more detail below:</p>
<ul>
<li>Opaque information in the checkpoint data that user mode can't
interpret or do anything with</li>
<li>A second API for creating objects (e.g. BOs) that is separate
from the regular BO creation API</li>
<li>Kernel mode would need to be involved in restoring BO sharing
relationships rather than replaying BO creation, export and
import from user mode<br>
</li>
</ul>
<p># Opaque information in the checkpoint<br>
</p>
<p>This comes out of ABI compatibility considerations. Adding any
new objects or attributes to the driver/HW state that needs to be
checkpointed could potentially break the ABI of the CRIU
checkpoint/restore ioctl if the plugin needs to parse that
information. Therefore, much of the information in our KFD CRIU
ioctl API is opaque. It is written by kernel mode in the
checkpoint, it is consumed by kernel mode when restoring the
checkpoint, but user mode doesn't care about the contents or
binary layout, so there is no user mode ABI to break. This is how
we were able to maintain CRIU support when we added the SVM API to
KFD without changing the CRIU plugin and without breaking our ABI.</p>
<p>Opaque information may also lend itself to API abstraction, if
this becomes a generic DRM API with driver-specific callbacks that
fill in HW-specific opaque data.<br>
</p>
<p># Second API for creating objects</p>
<p>Creating BOs and other objects when restoring a checkpoint needs
more information than the usual BO alloc and similar APIs provide.
For example, we need to restore BOs with the same GEM handles so
that user mode can continue using those handles after resuming
execution. If BOs are shared through DMABufs without dynamic
attachment, we need to restore pinned BOs as pinned. Validation of
virtual addresses and handling MMU notifiers must be suspended
until the virtual address space is restored. For user mode queues
we need to save and restore a lot of queue execution state so that
execution can resume cleanly.<br>
</p>
<p># Restoring buffer sharing relationships</p>
<p>Different GEM handles in different render nodes and processes can
refer to the same underlying shared memory, either by directly
pointing to the same GEM object, or by creating an import
attachment that may get its SG tables invalidated and updated
dynamically through dynamic attachment callbacks. In the latter
case it's obvious, who is the exporter and who is the importer. In
the first case, either one could be the exporter, and it's not
clear who would need to create the BO and who would need to import
it when restoring the checkpoint. To further complicate things,
multiple processes in a checkpoint get restored concurrently. So
there is no guarantee that an exporter has restored a shared BO at
the time an importer is trying to restore its import.</p>
<p>A proposal to deal with these problems would be to treat
importers and exporters the same. Whoever restores first, ends up
creating the BO and potentially attaching to it. The other
process(es) can find BOs that were already restored by another
process by looking it up with a unique ID that could be based on
the DMABuf inode number. An alternative would be a two-pass
approach that needs to restore BOs on two passes:</p>
<ol>
<li>Restore exported BOs</li>
<li>Restore imports</li>
</ol>
<p>With some inter-process synchronization in CRIU itself between
these two passes. This may require changes in the core CRIU,
outside our plugin. Both approaches depend on identifying BOs with
some unique ID that could be based on the DMABuf inode number in
the checkpoint. However, we would need to identify the processes
in the same restore session, possibly based on parent/child
process relationships, to create a scope where those IDs are valid
during restore.<br>
</p>
<p>Finally, we would also need to checkpoint and restore DMABuf file
descriptors themselves. These are anonymous file descriptors. The
CRIU plugin could probably be taught to recreate them from the
original exported BO based on the inode number that could be
queried with fstat in the checkpoint. It would need help from the
render node CRIU API to find the right BO from the inode, which
may be from a different process in the same restore session.<br>
</p>
<p>Regards,<br>
Felix<br>
</p>
<p><br>
</p>
</body>
</html>