Fw: [Intel-gfx] [PATCH 1/5] dri: cleanup debugfs error handling

Mon Oct 11 15:13:49 UTC 2021

On 10/11/2021 4:19 PM, Christian König wrote:
> Am 08.10.21 um 17:11 schrieb Greg KH:
>> On Fri, Oct 08, 2021 at 04:22:06PM +0200, Christian König wrote:
>>> Hi guys,
>>>
>>> thanks Nirmoy for forwarding this, there is seriously something 
>>> wrong with
>>> the AMD mail servers.
>>>
>>>> On 10/8/2021 1:07 PM, Greg KH wrote:
>>>>> On Fri, Oct 08, 2021 at 12:40:47PM +0300, Jani Nikula wrote:
>>>>>> On Fri, 08 Oct 2021, Nirmoy Das <nirmoy.das at amd.com> wrote:
>>>>>>> Debugfs API returns encoded error instead of NULL.
>>>>>>> This patch cleanups drm debugfs error handling to
>>>>>>> properly set dri and its minor's root dentry to NULL.
>>>>>>>
>>>>>>> Also do not error out if dri/minor debugfs directory
>>>>>>> creation fails as a debugfs error is not a fatal error.
>>>>>> Cc: Greg
>>>>>>
>>>>>> I thought this is the opposite of what Greg's been telling 
>>>>>> everyone to
>>>>>> do with debugfs.
>>>>> Yes, that is not good.
>>>>>
>>>>> You should never care about the result of a debugfs_create* call.  
>>>>> Just
>>>>> take the result, and if it is a directory, save it off to use it for
>>>>> creating a file, no need to check anything.
>>> While I totally agree to the intention behind that I'm pretty sure 
>>> there are
>>> some consequences which are a rather bad idea.
>>>
>>> First of all the debugfs functions return a normal struct dentry 
>>> pointer and
>>> keeping that around unchecked means that we now have pointers in 
>>> structure
>>> members which can suddenly be an ERR_PTR() without any documentation 
>>> that
>>> they are not real pointers.
>>>
>>> What we could do instead is something like returning a typedef or a
>>> structure with the dentry pointer embedded and then document that 
>>> those are
>>> special, can be ERR_PTR() and should never be touched except for the 
>>> debugfs
>>> functions itself.
>> I agree, and am slowly working toward that, but am not there yet.  If
>> you look, I have removed the return values for almost all
>> debugfs_create_* calls, only a few remain.
>>
>> For now, just treat it like a "void" pointer that you have no context
>> for and all will be fine.
>
> Ok in that case we should just document on the saved dentry that this 
> pointer is not necessary valid.
>
> Nirmoy, can you take care of that?

Sure, I will send patches to document and clean our drm core debugfs code.

Thanks,

Nirmoy

>
>>> The other issue is that adding the same file twice is unfortunately 
>>> a rather
>>> common coding error which we don't catch or complain about any more 
>>> if we
>>> don't look at the return value at all.
>> How can you add the same debugfs file twice?  You have different
>> directories.
>
> We had multiple occasions triggering that:
>
> 1. Code accidentally initializing a component twice.
>
> Except for the debugfs entry and a bit memory leak we had no negative 
> effect otherwise and wouldn't had detected that without the error 
> message from debugfs.
>
> 2. Driver not cleaning up properly on hotplug. E.g. old subdirectory 
> not properly removed and re-plug.
>
> 3. Driver trying to use the same debugfs file for different devices.
>
>> And really, who cares, it's debugging code!  :)
>
> Well especially cause 3 once took me a day to figure out that I'm 
> looking at the wrong hardware state because the driver was handling 
> two devices, but only one showed up under debugfs.
>
>>>>> And then throw it away, later, when you want to remove the directory,
>>>>> look it up with a call to debugfs_lookup() and pass that to
>>>>> debugfs_remove() (which does so recursively).
>>>>>
>>>>> There should never be a need to save, or check, the result of any
>>>>> debugfs call.  If so, odds are it is being used incorrectly.
>>> Yeah, exactly that's the problem I see here.
>>>
>>> We save the return value because the DRM subsystem is creating a 
>>> debugfs
>>> directory for the drivers to use.
>> That's fine for now, not a big deal.  And even if there is an error,
>> again, you can always feed that error back into the debugfs subsystem on
>> another call and it will handle it correctly.
>
> Problem is it isn't, we have a crash because the member isn't a 
> pointer but an ERR_PTR instead.
>
> And adding IS_ERR checks all around is even worse than adding NULL 
> checks.
>
> Regards,
> Christian.
>
>>
>> thanks,
>>
>> greg k-h
>