[PATCH] devcoredump: increase the device delete timeout to 10 mins

Abhinav Kumar quic_abhinavk at quicinc.com
Tue Feb 8 21:04:43 UTC 2022


Hi Johannes

Thanks for the response.

On 2/8/2022 12:35 PM, Johannes Berg wrote:
> On Tue, 2022-02-08 at 11:44 -0800, Abhinav Kumar wrote:
>> There are cases where depending on the size of the devcoredump and the speed
>> at which the usermode reads the dump, it can take longer than the current 5 mins
>> timeout.
>>
>> This can lead to incomplete dumps as the device is deleted once the timeout expires.
>>
>> One example is below where it took 6 mins for the devcoredump to be completely read.
>>
>> 04:22:24.668 23916 23994 I HWDeviceDRM::DumpDebugData: Opening /sys/class/devcoredump/devcd6/data
>> 04:28:35.377 23916 23994 W HWDeviceDRM::DumpDebugData: Freeing devcoredump node
>>
>> Increase the timeout to 10 mins to accommodate system delays and large coredump
>> sizes.
>>
> 
> No real objection, I guess, but can the data actually disappear *while*
> the sysfs file is open?!
> 
> Or did it take 5 minutes to open the file?
> 
> If the former, maybe we should fix that too (or instead)?
> 
> johannes

It opened the file rightaway but could not finish reading.

The device gets deleted so the corresponding /data will disappear too ( 
as the data node is under devcd*/data)

60 static void devcd_del(struct work_struct *wk)
61 {
62 	struct devcd_entry *devcd;
63
64 	devcd = container_of(wk, struct devcd_entry, del_wk.work);
65
66 	device_del(&devcd->devcd_dev);
67 	put_device(&devcd->devcd_dev);
68 }

Are you suggesting we implement a logic like :

a) if the usermode has started reading the data but has not finished yet 
( we can detect the former with something like devcd->data_read_ongoing 
= 1 and we know it has finished when it acks and we can clear this flag 
then), in the timeout del_wk then we can delay the the delete timer by 
another TIMEOUT amount of time to give usermode time to finish the data?

b) If usermode acks, we will clear both the flag and delete the device 
as usual

But there is a corner case here:

c) If usermode starts the read, but then for some reason crashes, the 
timer will timeout and try to delete the device but will detect that 
usermode is still reading and will keep the device. How do we detect 
this case?

Thats why i thought maybe the easier way right now is to try increasing 
the timeout.


More information about the dri-devel mailing list