[Patch v4 00/24] CHECKPOINT RESTORE WITH ROCm
Rajneesh Bhardwaj
rajneesh.bhardwaj at amd.com
Thu Dec 23 00:36:47 UTC 2021
CRIU is a user space tool which is very popular for container live
migration in datacentres. It can checkpoint a running application, save
its complete state, memory contents and all system resources to images
on disk which can be migrated to another m achine and restored later.
More information on CRIU can be found at https://criu.org/Main_Page
CRIU currently does not support Checkpoint / Restore with applications
that have devices files open so it cannot perform checkpoint and restore
on GPU devices which are very complex and have their own VRAM managed
privately. CRIU, however can support external devices by using a plugin
architecture. We feel that we are getting close to finalizing our IOCTL
APIs which were again changed since V3 for an improved modular design.
Our changes to CRIU user space are can be obtained from here:
https://github.com/RadeonOpenCompute/criu/tree/amdgpu_rfc-211222
We have tested the following scenarios:
- Checkpoint / Restore of a Pytorch (BERT) workload
- kfdtests with queues and events
- Gfx9 and Gfx10 based multi GPU test systems
- On baremetal and inside a docker container
- Restoring on a different system
V1: Initial
V2: Addressed review comments
V3: Rebased on latest amd-staging-drm-next (5.15 based)
v4: New API design and basic support for SVM, however there is an
outstanding issue with SVM restore which is currently under debug and
hopefully that won't impact the ioctl APIs as SVMs are treated as
private data hidden from user space like queues and events with the new
approch.
David Yat Sin (9):
drm/amdkfd: CRIU Implement KFD unpause operation
drm/amdkfd: CRIU add queues support
drm/amdkfd: CRIU restore queue ids
drm/amdkfd: CRIU restore sdma id for queues
drm/amdkfd: CRIU restore queue doorbell id
drm/amdkfd: CRIU checkpoint and restore queue mqds
drm/amdkfd: CRIU checkpoint and restore queue control stack
drm/amdkfd: CRIU checkpoint and restore events
drm/amdkfd: CRIU implement gpu_id remapping
Rajneesh Bhardwaj (15):
x86/configs: CRIU update debug rock defconfig
x86/configs: Add rock-rel_defconfig for amd-feature-criu branch
drm/amdkfd: CRIU Introduce Checkpoint-Restore APIs
drm/amdkfd: CRIU Implement KFD process_info ioctl
drm/amdkfd: CRIU Implement KFD checkpoint ioctl
drm/amdkfd: CRIU Implement KFD restore ioctl
drm/amdkfd: CRIU Implement KFD resume ioctl
drm/amdkfd: CRIU export BOs as prime dmabuf objects
drm/amdkfd: CRIU checkpoint and restore xnack mode
drm/amdkfd: CRIU allow external mm for svm ranges
drm/amdkfd: use user_gpu_id for svm ranges
drm/amdkfd: CRIU Discover svm ranges
drm/amdkfd: CRIU Save Shared Virtual Memory ranges
drm/amdkfd: CRIU prepare for svm resume
drm/amdkfd: CRIU resume shared virtual memory ranges
arch/x86/configs/rock-dbg_defconfig | 53 +-
arch/x86/configs/rock-rel_defconfig | 4927 +++++++++++++++++
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 6 +-
.../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 51 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 20 +
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h | 2 +
drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 1453 ++++-
drivers/gpu/drm/amd/amdkfd/kfd_dbgdev.c | 2 +-
.../drm/amd/amdkfd/kfd_device_queue_manager.c | 185 +-
.../drm/amd/amdkfd/kfd_device_queue_manager.h | 18 +-
drivers/gpu/drm/amd/amdkfd/kfd_events.c | 313 +-
drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager.h | 14 +
.../gpu/drm/amd/amdkfd/kfd_mqd_manager_cik.c | 72 +
.../gpu/drm/amd/amdkfd/kfd_mqd_manager_v10.c | 74 +
.../gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c | 89 +
.../gpu/drm/amd/amdkfd/kfd_mqd_manager_vi.c | 81 +
drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 166 +-
drivers/gpu/drm/amd/amdkfd/kfd_process.c | 86 +-
.../amd/amdkfd/kfd_process_queue_manager.c | 377 +-
drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 326 +-
drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 39 +
include/uapi/linux/kfd_ioctl.h | 79 +-
22 files changed, 8099 insertions(+), 334 deletions(-)
create mode 100644 arch/x86/configs/rock-rel_defconfig
--
2.17.1
More information about the dri-devel
mailing list