[Patch v4 18/24] drm/amdkfd: CRIU checkpoint and restore xnack mode

Felix Kuehling felix.kuehling at amd.com
Tue Jan 11 00:10:00 UTC 2022


On 2022-01-05 10:22 a.m., philip yang wrote:
>
>
> On 2021-12-22 7:37 p.m., Rajneesh Bhardwaj wrote:
>> Recoverable page faults are represented by the xnack mode setting inside
>> a kfd process and are used to represent the device page faults. For CR,
>> we don't consider negative values which are typically used for querying
>> the current xnack mode without modifying it.
>>
>> Signed-off-by: Rajneesh Bhardwaj<rajneesh.bhardwaj at amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 15 +++++++++++++++
>>   drivers/gpu/drm/amd/amdkfd/kfd_priv.h    |  1 +
>>   2 files changed, 16 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>> index 178b0ccfb286..446eb9310915 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>> @@ -1845,6 +1845,11 @@ static int criu_checkpoint_process(struct kfd_process *p,
>>   	memset(&process_priv, 0, sizeof(process_priv));
>>   
>>   	process_priv.version = KFD_CRIU_PRIV_VERSION;
>> +	/* For CR, we don't consider negative xnack mode which is used for
>> +	 * querying without changing it, here 0 simply means disabled and 1
>> +	 * means enabled so retry for finding a valid PTE.
>> +	 */
> Negative value to query xnack mode is for kfd_ioctl_set_xnack_mode 
> user space ioctl interface, which is not used by CRIU, I think this 
> comment is misleading,
>> +	process_priv.xnack_mode = p->xnack_enabled ? 1 : 0;
> change to process_priv.xnack_enabled
>>   
>>   	ret = copy_to_user(user_priv_data + *priv_offset,
>>   				&process_priv, sizeof(process_priv));
>> @@ -2231,6 +2236,16 @@ static int criu_restore_process(struct kfd_process *p,
>>   		return -EINVAL;
>>   	}
>>   
>> +	pr_debug("Setting XNACK mode\n");
>> +	if (process_priv.xnack_mode && !kfd_process_xnack_mode(p, true)) {
>> +		pr_err("xnack mode cannot be set\n");
>> +		ret = -EPERM;
>> +		goto exit;
>> +	} else {
>
> On GFXv9 GPUs except Aldebaran, this means the process checkpointed is 
> xnack off, it can restore and resume on GPU with xnack on, then shader 
> will continue running successfully, but driver is not guaranteed to 
> map svm ranges on GPU all the time, if retry fault happens, the shader 
> will not recover. Maybe change to:
>
> If (KFD_GC_VERSION(dev) != IP_VERSION(9, 4, 2) {
>
The code here was correct. The xnack mode applies to the whole process, 
not just one GPU. The logic for checking the capabilities of all GPUs is 
already in kfd_process_xnack_mode. If XNACK cannot be supported by all 
GPUs, restoring a non-0 XNACK mode will fail.

Any GPU can run in XNACK-disabled mode. So we don't need any limitations 
for process_priv.xnack_enabled == 0.

Regards,
   Felix


>     if (process_priv.xnack_enabled != kfd_process_xnack_mode(p, true)) {
>
>              pr_err("xnack mode cannot be set\n");
>
>              ret = -EPERM;
>
>              goto exit;
>
>     }
>
> }
>
> pr_debug("set xnack mode: %d\n", process_priv.xnack_enabled);
>
> p->xnack_enabled = process_priv.xnack_enabled;
>
>
>> +		pr_debug("set xnack mode: %d\n", process_priv.xnack_mode);
>> +		p->xnack_enabled = process_priv.xnack_mode;
>> +	}
>> +
>>   exit:
>>   	return ret;
>>   }
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>> index 855c162b85ea..d72dda84c18c 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>> @@ -1057,6 +1057,7 @@ void kfd_process_set_trap_handler(struct qcm_process_device *qpd,
>>   
>>   struct kfd_criu_process_priv_data {
>>   	uint32_t version;
>> +	uint32_t xnack_mode;
>
> bool xnack_enabled;
>
> Regards,
>
> Philip
>
>>   };
>>   
>>   struct kfd_criu_device_priv_data {


More information about the amd-gfx mailing list