[PATCH v2 00/18] CHECKPOINT RESTORE WITH ROCm
David Yat Sin
david.yatsin at amd.com
Tue Aug 24 19:06:21 UTC 2021
CRIU is a user space tool which is very popular for container live migration in datacentres. It can checkpoint a running application, save its complete state, memory contents and all system resources to images on disk which can be migrated to another m achine and restored later. More information on CRIU can be found at https://criu.org/Main_Page
CRIU currently does not support Checkpoint / Restore with applications that have devices files open so it cannot perform checkpoint and restore on GPU devices which are very complex and have their own VRAM managed privately. CRIU, however can support e xternal devices by using a plugin architecture. This patch series adds initial support for ROCm applications while we add more remaining features. We welcome some feedback, especially in regards to the APIs, before involving a larger audience.
Our plugin code can be found at https://github.com/RadeonOpenCompute/criu/tree/criu-dev/plugins/amdgpu
We have tested the following scenarios:
-Checkpoint / Restore of a Pytorch (BERT) workload -kfdtests with queues and events
-Gfx9 and Gfx10 based multi GPU test systems -On baremetal and inside a docker container -Restoring on a different system
V2: Addressed review comments
David Yat Sin (9):
drm/amdkfd: CRIU Implement KFD pause ioctl
drm/amdkfd: CRIU add queues support
drm/amdkfd: CRIU restore queue ids
drm/amdkfd: CRIU restore sdma id for queues
drm/amdkfd: CRIU restore queue doorbell id
drm/amdkfd: CRIU dump and restore queue mqds
drm/amdkfd: CRIU dump/restore queue control stack
drm/amdkfd: CRIU dump and restore events
drm/amdkfd: CRIU implement gpu_id remapping
Rajneesh Bhardwaj (9):
x86/configs: CRIU update release defconfig
x86/configs: CRIU update debug rock defconfig
drm/amdkfd: CRIU Introduce Checkpoint-Restore APIs
drm/amdkfd: CRIU Implement KFD process_info ioctl
drm/amdkfd: CRIU Implement KFD dumper ioctl
drm/amdkfd: CRIU Implement KFD restore ioctl
drm/amdkfd: CRIU Implement KFD resume ioctl
Revert "drm/amdgpu: Remove verify_access shortcut for KFD BOs"
drm/amdkfd: CRIU export kfd bos as prime dmabuf objects
arch/x86/configs/rock-dbg_defconfig | 53 +-
arch/x86/configs/rock-rel_defconfig | 13 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 5 +-
.../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 51 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 27 +
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h | 2 +
drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 1170 ++++++++++++++---
drivers/gpu/drm/amd/amdkfd/kfd_dbgdev.c | 2 +-
.../drm/amd/amdkfd/kfd_device_queue_manager.c | 187 ++-
.../drm/amd/amdkfd/kfd_device_queue_manager.h | 14 +-
drivers/gpu/drm/amd/amdkfd/kfd_events.c | 323 ++++-
drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager.h | 11 +
.../gpu/drm/amd/amdkfd/kfd_mqd_manager_cik.c | 76 ++
.../gpu/drm/amd/amdkfd/kfd_mqd_manager_v10.c | 78 ++
.../gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c | 86 ++
.../gpu/drm/amd/amdkfd/kfd_mqd_manager_vi.c | 77 ++
drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 138 +-
drivers/gpu/drm/amd/amdkfd/kfd_process.c | 69 +-
.../amd/amdkfd/kfd_process_queue_manager.c | 475 ++++++-
include/uapi/linux/kfd_ioctl.h | 221 +++-
20 files changed, 2815 insertions(+), 263 deletions(-)
--
2.17.1
More information about the amd-gfx
mailing list