[Patch v5 00/24] CHECKPOINT RESTORE WITH ROCm

Felix Kuehling felix.kuehling at amd.com
Fri Feb 4 03:22:11 UTC 2022


The series is

Reviewed-by: Felix Kuehling <Felix.Kuehling at amd.com>


Am 2022-02-03 um 04:08 schrieb Rajneesh Bhardwaj:
> V5: Proposed IOCTL APIs for CRIU with consolidated feedback
>
> CRIU is a user space tool which is very popular for container live
> migration in datacentres. It can checkpoint a running application, save
> its complete state, memory contents and all system resources to images
> on disk which can be migrated to another m achine and restored later.
> More information on CRIU can be found at https://criu.org/Main_Page
>
> CRIU currently does not support Checkpoint / Restore with applications
> that have devices files open so it cannot perform checkpoint and restore
> on GPU devices which are very complex and have their own VRAM managed
> privately. CRIU, however can support external devices by using a plugin
> architecture. We feel that we are getting close to finalizing our IOCTL
> APIs which were again changed since V3 for an improved modular design.
>
> Our changes to CRIU user space  are can be obtained from here:
> https://github.com/RadeonOpenCompute/criu/tree/amdgpu_rfc-211222
>
> We have tested the following scenarios:
>   - Checkpoint / Restore of a Pytorch (BERT) workload
>   - kfdtests with queues and events
>   - Gfx9 and Gfx10 based multi GPU test systems
>   - On baremetal and inside a docker container
>   - Restoring on a different system
>
> V1: Initial
> V2: Addressed review comments
> V3: Rebased on latest amd-staging-drm-next (5.15 based)
> v4: New API design and basic support for SVM, however there is an
> outstanding issue with SVM restore which is currently under debug and
> hopefully that won't impact the ioctl APIs as SVMs are treated as
> private data hidden from user space like queues and events with the new
> approch.
> V5: Fix the SVM related issues and finalize the APIs.
>
> David Yat Sin (9):
>    drm/amdkfd: CRIU Implement KFD unpause operation
>    drm/amdkfd: CRIU add queues support
>    drm/amdkfd: CRIU restore queue ids
>    drm/amdkfd: CRIU restore sdma id for queues
>    drm/amdkfd: CRIU restore queue doorbell id
>    drm/amdkfd: CRIU checkpoint and restore queue mqds
>    drm/amdkfd: CRIU checkpoint and restore queue control stack
>    drm/amdkfd: CRIU checkpoint and restore events
>    drm/amdkfd: CRIU implement gpu_id remapping
>
> Rajneesh Bhardwaj (15):
>    x86/configs: CRIU update debug rock defconfig
>    drm/amdkfd: CRIU Introduce Checkpoint-Restore APIs
>    drm/amdkfd: CRIU Implement KFD process_info ioctl
>    drm/amdkfd: CRIU Implement KFD checkpoint ioctl
>    drm/amdkfd: CRIU Implement KFD restore ioctl
>    drm/amdkfd: CRIU Implement KFD resume ioctl
>    drm/amdkfd: CRIU export BOs as prime dmabuf objects
>    drm/amdkfd: CRIU checkpoint and restore xnack mode
>    drm/amdkfd: CRIU allow external mm for svm ranges
>    drm/amdkfd: use user_gpu_id for svm ranges
>    drm/amdkfd: CRIU Discover svm ranges
>    drm/amdkfd: CRIU Save Shared Virtual Memory ranges
>    drm/amdkfd: CRIU prepare for svm resume
>    drm/amdkfd: CRIU resume shared virtual memory ranges
>    drm/amdkfd: Bump up KFD API version for CRIU
>
>   arch/x86/configs/rock-dbg_defconfig           |   53 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |    7 +-
>   .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   64 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |   20 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h       |    2 +
>   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      | 1471 ++++++++++++++---
>   drivers/gpu/drm/amd/amdkfd/kfd_dbgdev.c       |    2 +-
>   .../drm/amd/amdkfd/kfd_device_queue_manager.c |  185 ++-
>   .../drm/amd/amdkfd/kfd_device_queue_manager.h |   16 +-
>   drivers/gpu/drm/amd/amdkfd/kfd_events.c       |  313 +++-
>   drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager.h  |   14 +
>   .../gpu/drm/amd/amdkfd/kfd_mqd_manager_cik.c  |   75 +
>   .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v10.c  |   77 +
>   .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c   |   92 ++
>   .../gpu/drm/amd/amdkfd/kfd_mqd_manager_vi.c   |   84 +
>   drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |  160 +-
>   drivers/gpu/drm/amd/amdkfd/kfd_process.c      |   72 +-
>   .../amd/amdkfd/kfd_process_queue_manager.c    |  372 ++++-
>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c          |  331 +++-
>   drivers/gpu/drm/amd/amdkfd/kfd_svm.h          |   39 +
>   include/uapi/linux/kfd_ioctl.h                |   84 +-
>   21 files changed, 3193 insertions(+), 340 deletions(-)
>


More information about the dri-devel mailing list