[RESEND] amdkfd: check ret code for ioctls
Felix Kuehling
felix.kuehling at amd.com
Tue Nov 12 23:46:25 UTC 2024
On 2024-11-12 2:35, Zhu Lingshan wrote:
> On 11/12/2024 6:21 AM, Felix Kuehling wrote:
>>
>> On 2024-11-11 03:08, Zhu Lingshan wrote:
>>> On 11/5/2024 4:50 AM, Felix Kuehling wrote:
>>>> On 2024-10-31 22:35, Zhu Lingshan wrote:
>>>>> On 10/31/2024 11:30 PM, Felix Kuehling wrote:
>>>>>> On 2024-10-31 6:50, Zhu Lingshan wrote:
>>>>>>> The ioctl functions may fail, causing the args unreliable.
>>>>>>> Therefore, the args should not be copied to user space.
>>>>>>>
>>>>>>> The return code provides enough information for
>>>>>>> error handling in user space.
>>>>>>>
>>>>>>> This commit checks the return code of the ioctl functions
>>>>>>> and handles errors appropriately when they fail.
>>>>>> I have reviewed and rejected this patch before. My opinion has not changed. The existing code copies the ioctl arg structure back to user mode even in error cases because user mode needs additional information from that structure for some ioctls.
>>>>> how can the user space program distinguish the "good informational parameters" from the "bad default legacy parameters"? There can be other user space programs other than thunk.
>>>>>
>>>>> what if the user space program doing pulling mode, it can pull the args changes because ioctl is usually slower, our code should be robust.
>>>>>
>>>>> usually the return code provides enough information for the user space programs.
>>>> I don't understand your concern. Even without your patch, the failing ioctl still returns the error code to user mode. User mode can safely ignore additional information returned in the argument structure. You are raising concerns about performance or robustness. I don't see that either of those are negatively impacted by copying additional information in the argument struct to user mode.
>>> Still the questions:
>>> 1) how can the user space program distinguish the "good informational parameters" from the "bad default legacy parameters"? 2) what if the user space program doing pulling mode, pull the args before error code returned. Memory changes are usually faster than error code.
>>
>> There are no "bad default legacy parameters". Ioctls that were defined to return additional information in the parameter structure on errors have always done so. This should be documented in the kfd_ioctl.h header, though some ioctls have better documentation than others. For a good example, see kfd_ioctl_dbg_trap_get_queue_snapshot_args and kfd_ioctl_dbg_trap_get_device_snapshot_args, which do return the number of queues or devices in the parameter structure if user mode didn't allocate enough space.
>>
>> Another example is kfd_ioctl_map_memory_to_gpu_args and kfd_ioctl_unmap_memory_from_gpu_args, which returns the number of successful mappings if the ioctl fails. This is necessary to restart the operation after -ERESTARTSYS and skip mappings that were already completed.
> I believe these are bugs that should be fixed. If request N but only allocate M where M < N, the kernel space should return a proper error code and then user space should reduce the requesting number, like N/2.
That's not how these ioctls work. The ioctl needs to return data for all queues or all devices. User mode may not know that number ahead of time and may not have allocated enough memory. So the ioctl needs to be able to return the required allocation size in the failure case. This is what it's designed to do.
>
> We must keep do defensive programming to reduce potential risks & bugs, keep our system robust, we should not assume the other side work perfectly as expected.
Exactly. More information makes this easier. Removing information that has always been available breaks the ABI and does not improve robustness.
>>
>>
>>>> You mention that there can be other user mode clients other than Thunk. That's true. E.g. rocm-gdb calls KFD ioctls directly. And it depends on some of the additional information about errors. If you know of other user mode clients that are broken by the current behaviour, please point them out.
>>>>
>>>> Before anything else, we do not break existing user mode. Your patch breaks that rule. There is really no room for discussion here. I'm not seeing any reasonable argument to even consider your proposal.
>>> If a user space program needs to read arguments to do error recovery, then it is a buggy user space program that should be fixed.
>>> Usually the error code provides enough information for error handling. Why our KFD user space are exceptive?
>>
>> See my examples above. User mode is not buggy if it uses documented API behaviour, like what I showed above. In the case of -ERESTARTSYS, saving information in the argument structure is also necessary for the kernel mode driver itself, not just user mode.
> so we need a list to document what parameters are guaranteed to be safe to read in user space?
I don't think it's a safety concern. If you make it one, you need to define what you consider unsafe. From the kernel's point of view, user mode can do whatever they want with the data returned to user mode. If user mode ends up crashing because of it, that's a user mode bug unless the kernel violated some API contract.
That said, I did point you to API documentation that explains exactly what ioctl args contain useful information after certain ioctls return an error.
And Kernel must always make sure the parameters are reasonable even when ioctls fail in the first place?> I am not sure this is a good practice, we don't see many drivers rely on this kind of design.
I disagree. You can see the same behaviour in drm_ioctl.c:
retcode = drm_ioctl_kernel(filp, func, kdata, ioctl->flags);
if (copy_to_user((void __user *)arg, kdata, out_size) != 0)
retcode = -EFAULT;
...
This is the ioctl handler used by many GPU drivers in the kernel, including amdgpu.
>
> Here is an example explains why this is messy:
>
> kfd_ioctl returns -EINVAL if the requesting ioctl number > AMDKFD_CORE_IOCTL_COUNT. Here no kfd ioctl functions are invoked, therefore the arguments are not touched, they are legacy values or random values, even all 0.>
> Other kfd ioctl functions, like kfd_ioctl_create_queue and kfd_ioctl_destroy_queue, they also may return -EINVAL, but the arguments may be modified. How can the user space tell whether the arguments are modified by kernel, containing useful information, or not touched at all?
They know by reading the API documentation.
Regards,
Felix
>
> And what if the user space pull memory changes of the arguments?>
> Thanks
> Lingshan
>>
>> Regards,
>> Felix
>>
>>
>>>
>>> Thanks
>>> Lingshan
>>>> Regards,
>>>> Felix
>>>>
>>>>> Thanks
>>>>> Lingshan
>>>>>> Regards,
>>>>>> Felix
>>>>>>
>>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu at amd.com>
>>>>>>> ---
>>>>>>> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 3 +++
>>>>>>> 1 file changed, 3 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>>>>>>> index 3e6b4736a7fe..a184ca0023b5 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>>>>>>> @@ -3327,6 +3327,8 @@ static long kfd_ioctl(struct file *filep, unsigned int cmd, unsigned long arg)
>>>>>>> }
>>>>>>> retcode = func(filep, process, kdata);
>>>>>>> + if (retcode)
>>>>>>> + goto err_retcode;
>>>>>>> if (cmd & IOC_OUT)
>>>>>>> if (copy_to_user((void __user *)arg, kdata, usize) != 0)
>>>>>>> @@ -3340,6 +3342,7 @@ static long kfd_ioctl(struct file *filep, unsigned int cmd, unsigned long arg)
>>>>>>> if (kdata != stack_kdata)
>>>>>>> kfree(kdata);
>>>>>>> +err_retcode:
>>>>>>> if (retcode)
>>>>>>> dev_dbg(kfd_device, "ioctl cmd (#0x%x), arg 0x%lx, ret = %d\n",
>>>>>>> nr, arg, retcode);
>
More information about the amd-gfx
mailing list