[Patch v4 18/24] drm/amdkfd: CRIU checkpoint and restore xnack mode
Felix Kuehling
felix.kuehling at amd.com
Tue Jan 11 00:10:00 UTC 2022
On 2022-01-05 10:22 a.m., philip yang wrote:
>
>
> On 2021-12-22 7:37 p.m., Rajneesh Bhardwaj wrote:
>> Recoverable page faults are represented by the xnack mode setting inside
>> a kfd process and are used to represent the device page faults. For CR,
>> we don't consider negative values which are typically used for querying
>> the current xnack mode without modifying it.
>>
>> Signed-off-by: Rajneesh Bhardwaj<rajneesh.bhardwaj at amd.com>
>> ---
>> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 15 +++++++++++++++
>> drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 1 +
>> 2 files changed, 16 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>> index 178b0ccfb286..446eb9310915 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>> @@ -1845,6 +1845,11 @@ static int criu_checkpoint_process(struct kfd_process *p,
>> memset(&process_priv, 0, sizeof(process_priv));
>>
>> process_priv.version = KFD_CRIU_PRIV_VERSION;
>> + /* For CR, we don't consider negative xnack mode which is used for
>> + * querying without changing it, here 0 simply means disabled and 1
>> + * means enabled so retry for finding a valid PTE.
>> + */
> Negative value to query xnack mode is for kfd_ioctl_set_xnack_mode
> user space ioctl interface, which is not used by CRIU, I think this
> comment is misleading,
>> + process_priv.xnack_mode = p->xnack_enabled ? 1 : 0;
> change to process_priv.xnack_enabled
>>
>> ret = copy_to_user(user_priv_data + *priv_offset,
>> &process_priv, sizeof(process_priv));
>> @@ -2231,6 +2236,16 @@ static int criu_restore_process(struct kfd_process *p,
>> return -EINVAL;
>> }
>>
>> + pr_debug("Setting XNACK mode\n");
>> + if (process_priv.xnack_mode && !kfd_process_xnack_mode(p, true)) {
>> + pr_err("xnack mode cannot be set\n");
>> + ret = -EPERM;
>> + goto exit;
>> + } else {
>
> On GFXv9 GPUs except Aldebaran, this means the process checkpointed is
> xnack off, it can restore and resume on GPU with xnack on, then shader
> will continue running successfully, but driver is not guaranteed to
> map svm ranges on GPU all the time, if retry fault happens, the shader
> will not recover. Maybe change to:
>
> If (KFD_GC_VERSION(dev) != IP_VERSION(9, 4, 2) {
>
The code here was correct. The xnack mode applies to the whole process,
not just one GPU. The logic for checking the capabilities of all GPUs is
already in kfd_process_xnack_mode. If XNACK cannot be supported by all
GPUs, restoring a non-0 XNACK mode will fail.
Any GPU can run in XNACK-disabled mode. So we don't need any limitations
for process_priv.xnack_enabled == 0.
Regards,
Felix
> if (process_priv.xnack_enabled != kfd_process_xnack_mode(p, true)) {
>
> pr_err("xnack mode cannot be set\n");
>
> ret = -EPERM;
>
> goto exit;
>
> }
>
> }
>
> pr_debug("set xnack mode: %d\n", process_priv.xnack_enabled);
>
> p->xnack_enabled = process_priv.xnack_enabled;
>
>
>> + pr_debug("set xnack mode: %d\n", process_priv.xnack_mode);
>> + p->xnack_enabled = process_priv.xnack_mode;
>> + }
>> +
>> exit:
>> return ret;
>> }
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>> index 855c162b85ea..d72dda84c18c 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>> @@ -1057,6 +1057,7 @@ void kfd_process_set_trap_handler(struct qcm_process_device *qpd,
>>
>> struct kfd_criu_process_priv_data {
>> uint32_t version;
>> + uint32_t xnack_mode;
>
> bool xnack_enabled;
>
> Regards,
>
> Philip
>
>> };
>>
>> struct kfd_criu_device_priv_data {
More information about the amd-gfx
mailing list