[PATCH] drm/xe: Fix fault on fd close when wedged

Tue Dec 17 22:20:51 UTC 2024

On Thu, Dec 12, 2024 at 08:30:03AM -0800, Matthew Brost wrote:
>On Wed, Dec 11, 2024 at 10:26:15PM -0600, Lucas De Marchi wrote:
>> On Wed, Dec 11, 2024 at 07:51:57PM -0800, Matthew Brost wrote:
>> > On Wed, Dec 11, 2024 at 02:53:32PM -0800, Lucas De Marchi wrote:
>> > > If device is wedged, the final run ticks update for the client should be
>> > > skipped as it's already unmapped. Fix this pagefault when forcing a
>> >
>> > Where does exec queue get unmapped on wedging a device?
>>
>> it's the lrc we are trying to read: lrc->bo, with offset == timestamp.
>>
>> I thought it was part of the the xe_gt_declare_wedged(), but I'm not
>> following what triggers it - the only thing that should trigger that
>> would be the xe_lrc_put() and the refcount reaching 0. Something is not
>> adding up here, I will have to trace the destroy to see what's going on.
>>
>
>Quick reached the same conclusion - accessing the LRC BO should be safe
>until the final put, hence my question. It does appear something else
>weird is going on here.

finally had some time to analyze this again. So the issue is actually a
between unbind and close, because the test in question here does:

	fd = drm_open_driver(DRIVER_XE);
	...
	fd = xe_sysfs_driver_do(fd, pci_slot, XE_SYSFS_DRIVER_REBIND);
	drm_close_driver(fd);

note that the application is "leaking" the first fd when it opens the device
again. When the fd is closed on termination, the device already went
through unbind (i.e. xe_pci_remove()).

I will send a new patch to fix that.

Lucas De Marchi

>
>Matt
>
>> Lucas De Marchi