[Patch v5 00/24] CHECKPOINT RESTORE WITH ROCm
Bhardwaj, Rajneesh
Rajneesh.Bhardwaj at amd.com
Fri Feb 4 03:23:05 UTC 2022
[AMD Official Use Only]
Thank you Felix for the review and your guidance.
-----Original Message-----
From: Kuehling, Felix <Felix.Kuehling at amd.com>
Sent: Thursday, February 3, 2022 10:22 PM
To: Bhardwaj, Rajneesh <Rajneesh.Bhardwaj at amd.com>; amd-gfx at lists.freedesktop.org
Cc: Yat Sin, David <David.YatSin at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>; dri-devel at lists.freedesktop.org
Subject: Re: [Patch v5 00/24] CHECKPOINT RESTORE WITH ROCm
The series is
Reviewed-by: Felix Kuehling <Felix.Kuehling at amd.com>
Am 2022-02-03 um 04:08 schrieb Rajneesh Bhardwaj:
> V5: Proposed IOCTL APIs for CRIU with consolidated feedback
>
> CRIU is a user space tool which is very popular for container live
> migration in datacentres. It can checkpoint a running application,
> save its complete state, memory contents and all system resources to
> images on disk which can be migrated to another m achine and restored later.
> More information on CRIU can be found at https://criu.org/Main_Page
>
> CRIU currently does not support Checkpoint / Restore with applications
> that have devices files open so it cannot perform checkpoint and
> restore on GPU devices which are very complex and have their own VRAM
> managed privately. CRIU, however can support external devices by using
> a plugin architecture. We feel that we are getting close to finalizing
> our IOCTL APIs which were again changed since V3 for an improved modular design.
>
> Our changes to CRIU user space are can be obtained from here:
> https://github.com/RadeonOpenCompute/criu/tree/amdgpu_rfc-211222
>
> We have tested the following scenarios:
> - Checkpoint / Restore of a Pytorch (BERT) workload
> - kfdtests with queues and events
> - Gfx9 and Gfx10 based multi GPU test systems
> - On baremetal and inside a docker container
> - Restoring on a different system
>
> V1: Initial
> V2: Addressed review comments
> V3: Rebased on latest amd-staging-drm-next (5.15 based)
> v4: New API design and basic support for SVM, however there is an
> outstanding issue with SVM restore which is currently under debug and
> hopefully that won't impact the ioctl APIs as SVMs are treated as
> private data hidden from user space like queues and events with the
> new approch.
> V5: Fix the SVM related issues and finalize the APIs.
>
> David Yat Sin (9):
> drm/amdkfd: CRIU Implement KFD unpause operation
> drm/amdkfd: CRIU add queues support
> drm/amdkfd: CRIU restore queue ids
> drm/amdkfd: CRIU restore sdma id for queues
> drm/amdkfd: CRIU restore queue doorbell id
> drm/amdkfd: CRIU checkpoint and restore queue mqds
> drm/amdkfd: CRIU checkpoint and restore queue control stack
> drm/amdkfd: CRIU checkpoint and restore events
> drm/amdkfd: CRIU implement gpu_id remapping
>
> Rajneesh Bhardwaj (15):
> x86/configs: CRIU update debug rock defconfig
> drm/amdkfd: CRIU Introduce Checkpoint-Restore APIs
> drm/amdkfd: CRIU Implement KFD process_info ioctl
> drm/amdkfd: CRIU Implement KFD checkpoint ioctl
> drm/amdkfd: CRIU Implement KFD restore ioctl
> drm/amdkfd: CRIU Implement KFD resume ioctl
> drm/amdkfd: CRIU export BOs as prime dmabuf objects
> drm/amdkfd: CRIU checkpoint and restore xnack mode
> drm/amdkfd: CRIU allow external mm for svm ranges
> drm/amdkfd: use user_gpu_id for svm ranges
> drm/amdkfd: CRIU Discover svm ranges
> drm/amdkfd: CRIU Save Shared Virtual Memory ranges
> drm/amdkfd: CRIU prepare for svm resume
> drm/amdkfd: CRIU resume shared virtual memory ranges
> drm/amdkfd: Bump up KFD API version for CRIU
>
> arch/x86/configs/rock-dbg_defconfig | 53 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 7 +-
> .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 64 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 20 +
> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h | 2 +
> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 1471 ++++++++++++++---
> drivers/gpu/drm/amd/amdkfd/kfd_dbgdev.c | 2 +-
> .../drm/amd/amdkfd/kfd_device_queue_manager.c | 185 ++-
> .../drm/amd/amdkfd/kfd_device_queue_manager.h | 16 +-
> drivers/gpu/drm/amd/amdkfd/kfd_events.c | 313 +++-
> drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager.h | 14 +
> .../gpu/drm/amd/amdkfd/kfd_mqd_manager_cik.c | 75 +
> .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v10.c | 77 +
> .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c | 92 ++
> .../gpu/drm/amd/amdkfd/kfd_mqd_manager_vi.c | 84 +
> drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 160 +-
> drivers/gpu/drm/amd/amdkfd/kfd_process.c | 72 +-
> .../amd/amdkfd/kfd_process_queue_manager.c | 372 ++++-
> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 331 +++-
> drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 39 +
> include/uapi/linux/kfd_ioctl.h | 84 +-
> 21 files changed, 3193 insertions(+), 340 deletions(-)
>
More information about the dri-devel
mailing list