[Patch v5 00/24] CHECKPOINT RESTORE WITH ROCm

Bhardwaj, Rajneesh Rajneesh.Bhardwaj at amd.com
Fri Feb 4 03:23:05 UTC 2022


[AMD Official Use Only]

Thank you Felix for the review and your guidance.

-----Original Message-----
From: Kuehling, Felix <Felix.Kuehling at amd.com> 
Sent: Thursday, February 3, 2022 10:22 PM
To: Bhardwaj, Rajneesh <Rajneesh.Bhardwaj at amd.com>; amd-gfx at lists.freedesktop.org
Cc: Yat Sin, David <David.YatSin at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>; dri-devel at lists.freedesktop.org
Subject: Re: [Patch v5 00/24] CHECKPOINT RESTORE WITH ROCm

The series is

Reviewed-by: Felix Kuehling <Felix.Kuehling at amd.com>


Am 2022-02-03 um 04:08 schrieb Rajneesh Bhardwaj:
> V5: Proposed IOCTL APIs for CRIU with consolidated feedback
>
> CRIU is a user space tool which is very popular for container live 
> migration in datacentres. It can checkpoint a running application, 
> save its complete state, memory contents and all system resources to 
> images on disk which can be migrated to another m achine and restored later.
> More information on CRIU can be found at https://criu.org/Main_Page
>
> CRIU currently does not support Checkpoint / Restore with applications 
> that have devices files open so it cannot perform checkpoint and 
> restore on GPU devices which are very complex and have their own VRAM 
> managed privately. CRIU, however can support external devices by using 
> a plugin architecture. We feel that we are getting close to finalizing 
> our IOCTL APIs which were again changed since V3 for an improved modular design.
>
> Our changes to CRIU user space  are can be obtained from here:
> https://github.com/RadeonOpenCompute/criu/tree/amdgpu_rfc-211222
>
> We have tested the following scenarios:
>   - Checkpoint / Restore of a Pytorch (BERT) workload
>   - kfdtests with queues and events
>   - Gfx9 and Gfx10 based multi GPU test systems
>   - On baremetal and inside a docker container
>   - Restoring on a different system
>
> V1: Initial
> V2: Addressed review comments
> V3: Rebased on latest amd-staging-drm-next (5.15 based)
> v4: New API design and basic support for SVM, however there is an 
> outstanding issue with SVM restore which is currently under debug and 
> hopefully that won't impact the ioctl APIs as SVMs are treated as 
> private data hidden from user space like queues and events with the 
> new approch.
> V5: Fix the SVM related issues and finalize the APIs.
>
> David Yat Sin (9):
>    drm/amdkfd: CRIU Implement KFD unpause operation
>    drm/amdkfd: CRIU add queues support
>    drm/amdkfd: CRIU restore queue ids
>    drm/amdkfd: CRIU restore sdma id for queues
>    drm/amdkfd: CRIU restore queue doorbell id
>    drm/amdkfd: CRIU checkpoint and restore queue mqds
>    drm/amdkfd: CRIU checkpoint and restore queue control stack
>    drm/amdkfd: CRIU checkpoint and restore events
>    drm/amdkfd: CRIU implement gpu_id remapping
>
> Rajneesh Bhardwaj (15):
>    x86/configs: CRIU update debug rock defconfig
>    drm/amdkfd: CRIU Introduce Checkpoint-Restore APIs
>    drm/amdkfd: CRIU Implement KFD process_info ioctl
>    drm/amdkfd: CRIU Implement KFD checkpoint ioctl
>    drm/amdkfd: CRIU Implement KFD restore ioctl
>    drm/amdkfd: CRIU Implement KFD resume ioctl
>    drm/amdkfd: CRIU export BOs as prime dmabuf objects
>    drm/amdkfd: CRIU checkpoint and restore xnack mode
>    drm/amdkfd: CRIU allow external mm for svm ranges
>    drm/amdkfd: use user_gpu_id for svm ranges
>    drm/amdkfd: CRIU Discover svm ranges
>    drm/amdkfd: CRIU Save Shared Virtual Memory ranges
>    drm/amdkfd: CRIU prepare for svm resume
>    drm/amdkfd: CRIU resume shared virtual memory ranges
>    drm/amdkfd: Bump up KFD API version for CRIU
>
>   arch/x86/configs/rock-dbg_defconfig           |   53 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |    7 +-
>   .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   64 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |   20 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h       |    2 +
>   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      | 1471 ++++++++++++++---
>   drivers/gpu/drm/amd/amdkfd/kfd_dbgdev.c       |    2 +-
>   .../drm/amd/amdkfd/kfd_device_queue_manager.c |  185 ++-
>   .../drm/amd/amdkfd/kfd_device_queue_manager.h |   16 +-
>   drivers/gpu/drm/amd/amdkfd/kfd_events.c       |  313 +++-
>   drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager.h  |   14 +
>   .../gpu/drm/amd/amdkfd/kfd_mqd_manager_cik.c  |   75 +
>   .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v10.c  |   77 +
>   .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c   |   92 ++
>   .../gpu/drm/amd/amdkfd/kfd_mqd_manager_vi.c   |   84 +
>   drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |  160 +-
>   drivers/gpu/drm/amd/amdkfd/kfd_process.c      |   72 +-
>   .../amd/amdkfd/kfd_process_queue_manager.c    |  372 ++++-
>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c          |  331 +++-
>   drivers/gpu/drm/amd/amdkfd/kfd_svm.h          |   39 +
>   include/uapi/linux/kfd_ioctl.h                |   84 +-
>   21 files changed, 3193 insertions(+), 340 deletions(-)
>


More information about the amd-gfx mailing list