[RFC PATCH 0/3] gpu: nova-core: add basic timer subdevice implementation

Fri Mar 7 14:55:57 UTC 2025

On Fri, Mar 07, 2025 at 02:09:12PM +0100, Simona Vetter wrote:

> > A driver can do a health check immediately in remove() and make a
> > decision if the device is alive or not to speed up removal in the
> > hostile hot unplug case.
> 
> Hm ... I guess when you get an all -1 read you check with a specific
> register to make sure it's not a false positive? Since for some registers
> that's a valid value.

Yes. mlx5 has HW designed to support this, but I imagine on most
devices you could find an ID register or something that won't be -1.

> - The "at least we don't blow up with memory safety issues" bare minimum
>   that the rust abstractions should guarantee. So revocable and friends.

I still really dislike recovable because it imposes a cost that is
unnecessary.

> And I think the latter safety fallback does not prevent you from doing the
> full fancy design, e.g. for revocable resources that only happens after
> your explicitly-coded ->remove() callback has finished. Which means you
> still have full access to the hw like anywhere else.

Yes, if you use rust bindings with something like RDMA then I would
expect that by the time remove is done everything is cleaned up and
all the revokable stuff was useless and never used.

This is why I dislike revoke so much. It is adding a bunch of garbage
all over the place that is *never used* if the driver is working
correctly.

I believe it is much better to runtime check that the driver is
correct and not burden the API design with this.

Giving people these features will only encourage them to write wrong
drivers.

This is not even a new idea, devm introduces automatic lifetime into
the kernel and I've sat in presentations about how devm has all sorts
of bug classes because of misuse. :\

> Does this sounds like a possible conclusion of this thread, or do we need
> to keep digging?

IDK, I think this should be socialized more. It is important as it
effects all drivers here out, and it is radically different to how the
kernel works today.

> Also now that I look at this problem as a two-level issue, I think drm is
> actually a lot better than what I explained. If you clean up driver state
> properly in ->remove (or as stack automatic cleanup functions that run
> before all the mmio/irq/whatever stuff disappears), then we are largely
> there already with being able to fully quiescent driver state enough to
> make sure no new requests can sneak in. 

That is the typical subsystem design!

Thanks,
Jason