[RFC] CRIU support for ROCm

Adrian Reber adrian at lisas.de
Tue May 4 12:32:37 UTC 2021


On Mon, May 03, 2021 at 02:21:53PM -0400, Felix Kuehling wrote:
> Am 2021-05-01 um 1:03 p.m. schrieb Adrian Reber:
> > On Fri, Apr 30, 2021 at 09:57:45PM -0400, Felix Kuehling wrote:
> >> We have been working on a prototype supporting CRIU (Checkpoint/Restore
> >> In Userspace) for accelerated compute applications running on AMD GPUs
> >> using ROCm (Radeon Open Compute Platform). We're happy to finally share
> >> this work publicly to solicit feedback and advice. The end-goal is to
> >> get this work included upstream in Linux and CRIU. A short whitepaper
> >> describing our design and intention can be found on Github:
> >> https://github.com/RadeonOpenCompute/criu/tree/criu-dev/test/others/ext-kfd/README.md
> >>
> >> We have RFC patch series for the kernel (based on Alex Deucher's
> >> amd-staging-drm-next branch) and for CRIU including a new plugin and a
> >> few core CRIU changes. I will send those to the respective mailing lists
> >> separately in a minute. They can also be found on Github.
> >>
> >>     CRIU+plugin: https://github.com/RadeonOpenCompute/criu/commits/criu-dev
> >>     Kernel (KFD):
> >>     https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/commits/fxkamd/criu-wip
> >>
> >> At this point this is very much a work in progress and not ready for
> >> upstream inclusion. There are still several missing features, known
> >> issues, and open questions that we would like to start addressing with
> >> your feedback.
> >>
> >> What's working and tested at this point:
> >>
> >>   * Checkpoint and restore accelerated machine learning apps: PyTorch
> >>     running Bert on systems with 1 or 2 GPUs (MI50 or MI100), 100%
> >>     unmodified user mode stack
> >>   * Checkpoint on one system, restore on a different system
> >>   * Checkpoint on one GPU, restore on a different GPU
> > This is very impressive. As far as I know this is the first larger
> > plugin written for CRIU and publicly published. It is also the first GPU
> > supported and people have been asking this for many years. It is in fact
> > the first hardware device supported through a plugin.
> >
> >> Major Known issues:
> >>
> >>   * The KFD ioctl API is not final: Needs a complete redesign to allow
> >>     future extension without breaking the ABI
> >>   * Very slow: Need to implement DMA to dump VRAM contents
> >>
> >> Missing or incomplete features:
> >>
> >>   * Support for the new KFD SVM API
> >>   * Check device topology during restore
> >>   * Checkpoint and restore multiple processes
> >>   * Support for applications using Mesa for video decode/encode
> >>   * Testing with more different GPUs and workloads
> >>
> >> Big Open questions:
> >>
> >>   * What's the preferred way to publish our CRIU plugin? In-tree or
> >>     out-of-tree?
> > I would do it in-tree.
> >
> >>   * What's the preferred way to distribute our CRIU plugin? Source?
> >>     Binary .so? Whole CRIU? Just in-box support?
> > As you are planing to publish the source I would make it part of the
> > CRIU repository and this way it will find its way to the packages in the
> > different distributions.
> 
> Thanks. These are the answers I was hoping for.
> 
> 
> >
> > Does the plugin require any additional dependencies? If there is no
> > additional dependency to a library the plugin can be easily be part of
> > the existing packages.
> 
> The DMA solution we're considering for saving VRAM contents would add a
> dependency on libdrm and libdrm-amdgpu.

For the CRIU packages I am maintaining I would probably put the plugin
in a sub-package so that not all users of the CRIU package have to
install the mentioned libraries.

> >>   * If our plugin can be upstreamed in the CRIU tree, what would be the
> >>     right directory?
> > I would just put it into criu/plugins/
> 
> Sounds good.
> 
> >
> > It would also be good to have your patchset submitted as a PR on github
> > to have our normal CI test coverage of the changes.
> 
> We'll probably have to recreate our repository to start as a fork of the
> upstream CRIU repository, so that we can easily send pull-requests.
> We're not going to be ready for upstreaming for a few more months,
> probably. Do you want to get occasionaly pull requests anyway, just to
> run CI on our work-in-progress code?

If you run it early through our CI it might make it easier for you to
see what it might break. Also, if your patches include fixes which are
not directly related to your plugin, it might make sense to submit those
patches earlier to reduce the size of the final patch. But this is up to
you.

		Adrian


More information about the amd-gfx mailing list