[RFC] CRIU support for ROCm

Felix Kuehling felix.kuehling at amd.com
Mon May 3 18:21:53 UTC 2021


Am 2021-05-01 um 1:03 p.m. schrieb Adrian Reber:
> On Fri, Apr 30, 2021 at 09:57:45PM -0400, Felix Kuehling wrote:
>> We have been working on a prototype supporting CRIU (Checkpoint/Restore
>> In Userspace) for accelerated compute applications running on AMD GPUs
>> using ROCm (Radeon Open Compute Platform). We're happy to finally share
>> this work publicly to solicit feedback and advice. The end-goal is to
>> get this work included upstream in Linux and CRIU. A short whitepaper
>> describing our design and intention can be found on Github:
>> https://github.com/RadeonOpenCompute/criu/tree/criu-dev/test/others/ext-kfd/README.md
>>
>> We have RFC patch series for the kernel (based on Alex Deucher's
>> amd-staging-drm-next branch) and for CRIU including a new plugin and a
>> few core CRIU changes. I will send those to the respective mailing lists
>> separately in a minute. They can also be found on Github.
>>
>>     CRIU+plugin: https://github.com/RadeonOpenCompute/criu/commits/criu-dev
>>     Kernel (KFD):
>>     https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/commits/fxkamd/criu-wip
>>
>> At this point this is very much a work in progress and not ready for
>> upstream inclusion. There are still several missing features, known
>> issues, and open questions that we would like to start addressing with
>> your feedback.
>>
>> What's working and tested at this point:
>>
>>   * Checkpoint and restore accelerated machine learning apps: PyTorch
>>     running Bert on systems with 1 or 2 GPUs (MI50 or MI100), 100%
>>     unmodified user mode stack
>>   * Checkpoint on one system, restore on a different system
>>   * Checkpoint on one GPU, restore on a different GPU
> This is very impressive. As far as I know this is the first larger
> plugin written for CRIU and publicly published. It is also the first GPU
> supported and people have been asking this for many years. It is in fact
> the first hardware device supported through a plugin.
>
>> Major Known issues:
>>
>>   * The KFD ioctl API is not final: Needs a complete redesign to allow
>>     future extension without breaking the ABI
>>   * Very slow: Need to implement DMA to dump VRAM contents
>>
>> Missing or incomplete features:
>>
>>   * Support for the new KFD SVM API
>>   * Check device topology during restore
>>   * Checkpoint and restore multiple processes
>>   * Support for applications using Mesa for video decode/encode
>>   * Testing with more different GPUs and workloads
>>
>> Big Open questions:
>>
>>   * What's the preferred way to publish our CRIU plugin? In-tree or
>>     out-of-tree?
> I would do it in-tree.
>
>>   * What's the preferred way to distribute our CRIU plugin? Source?
>>     Binary .so? Whole CRIU? Just in-box support?
> As you are planing to publish the source I would make it part of the
> CRIU repository and this way it will find its way to the packages in the
> different distributions.

Thanks. These are the answers I was hoping for.


>
> Does the plugin require any additional dependencies? If there is no
> additional dependency to a library the plugin can be easily be part of
> the existing packages.

The DMA solution we're considering for saving VRAM contents would add a
dependency on libdrm and libdrm-amdgpu.


>
>>   * If our plugin can be upstreamed in the CRIU tree, what would be the
>>     right directory?
> I would just put it into criu/plugins/

Sounds good.


>
> It would also be good to have your patchset submitted as a PR on github
> to have our normal CI test coverage of the changes.

We'll probably have to recreate our repository to start as a fork of the
upstream CRIU repository, so that we can easily send pull-requests.
We're not going to be ready for upstreaming for a few more months,
probably. Do you want to get occasionaly pull requests anyway, just to
run CI on our work-in-progress code?

Regards,
  Felix


>
> 		Adrian


More information about the amd-gfx mailing list