BUG / design challenge: vmwgfx + PRIME handle free == clobbering errno

Fri Jan 12 09:30:52 UTC 2024

On Fri, Dec 22, 2023 at 08:27:14PM -0500, Zack Rusin wrote:
> On Tue, Dec 19, 2023 at 11:15 AM Stefan Hoffmeister
> <stefan.hoffmeister at econos.de> wrote:
> >
> >
> > Hi,
> >
> > vmwgfx implements drmPrimeFDToHandle in terms of the TTM resource manager.
> >
> > At the same time, the driver advertises
> >
> >         .driver_features =
> >         DRIVER_MODESET | DRIVER_RENDER | DRIVER_ATOMIC | DRIVER_GEM,
> >
> > Given that, a userland application will call drmCloseBufferHandle to
> > correctly release any previously acquired handle.
> >
> > In kernel language, this translates to ioctl
> > DRM_IOCTL_PRIME_FD_TO_HANDLE and DRM_IOCTL_GEM_CLOSE.
> >
> > Problems:
> >
> > a) because the application uses drmCloseBufferHandle, and that goes
> > through GEM (the driver claims "I am GEM"!), vmwgfx will always return
> > EINVAL, because the driver is _not_ GEM, but TTM on that aspect.
> >
> > This will always clobber errno for userspace, which is not at all
> > helpful for meaningful error handling.
> >
> > Additionally, the driver offers no formal API contract which would
> > allow applications to detect that the driver does not like be talked
> > to like a GEM driver, although it is _claims_ to be GEM driver.

Yeah this is really badly breaking expectations of generic userspace :-/

> > b) there is no (documented) means to release the handle acquired from
> > a call to drmPrimeFDToHandle; this implies that in principle any call
> > to drmPrimeFDToHandle will leak the handle.
> >
> > Questions:
> >
> > a) how can errno clobbering by vmwgfx with EINVAL be avoided?
> >
> > b) what is the correct way to release the TTM resources allocated
> > through drmPrimeFDToHandle?
> >
> >
> > For context: the KDE Plasma Desktop kwin DRM integration layer makes
> > extensive use of drmPrimeFDToHandle. With the way the VMware vmwgfx
> > driver behaves, I see these options:
> >
> > a) correctly check the return code of drmCloseBufferHandle and blindly
> > log all the EINVAL errors triggered by the VMware driver, putting
> > blame on the application for mis-managing handles
> > b) drop any return code from drmCloseBufferHandle onto the floor,
> > completely disregarding application integrity checking
> > c) hard-code a check for "vmwgfx" as the driver in use and spam "this
> > driver owned by VMware / Broadcom is broken, ignore the following
> > EINVAL and secondary effects" for integrity checks
> >
> > I do not like any of these options, to be honest.
> >
> > Many thanks for any input
> 
> Thanks for the report! This is mainly an artifact of vmwgfx predating
> GEM and really basically all of drm, so the ioctl interface on top of
> which vmwgfx operates is a bit bonkers. I know there are a number of
> examples where drm graphics drivers were breaking the ioctl interface,
> but we've tried avoiding that in vmwgfx and so supporting our old
> userspace and new userspace, which are incompatible, is incredibly
> complex and fragile. Our approach was to, in general, deal with those
> as they come. Our userspace, also in general, deals with surfaces,
> which aren't GEM's, as you've noticed, but instead are backed by a GEM
> object, so expects the handle to prime to not be GEM's. As you can
> imagine it's a bit hard to make sure ioctl's for which userspace has
> the opposite expectations are working fine (we've done it for some
> already though, the "dumb" ones being an example).
> 
> If you'd have an IGT test that tests for this without requiring CRC's
> then we'd get it fixed quickly. Otherwise making it work is difficult,
> not because this particular bug is difficult to fix, but because it's
> difficult to fix it while keeping the xorg drivers from 2012 working.
> We'll likely just make sure the prime ioctl's return proper GEM
> objects for clients that advertise certain cap's. Is KWin advertising
> DRM_CLIENT_CAP_CURSOR_PLANE_HOTSPOT?  That'd be a trivial, albeit
> unexpected, way of making sure the old userspace gets the semantics it
> expects, while making sure the new clients get proper behavior.

Instead of trying to detect generic userspace, which I think is futile
because there's really not anything beyoned "generic userspace expects
gem handles, so give it to them", would it be possible to detect the
legacy driver somehow? Because generic kms clients are really not required
to use any specific feature at all that kms provides, so I don't think
there's a way to reliably detect those that works across everything.

Like some legacy ioctl/parameter set that only the legacy xorg-modesetting
driver uses should fit the bill here, or maybe a combination of them. Some
ideas:

- if you get an ADDFB call with a legacy buffer (not a gem buffer), mark
  the client as legacy and then return legacy buffer handles, not gem ones
  from fd2handle.

- maybe similar for "you get a legacy buffer handle for SETCURSOR ioctl"

- maybe there's a unique setup sequence that only the xorg driver uses?
  Like a sequence of ioctl/ioctl parameters that's unique to the old xorg
  driver.

It's still brittle, but since no one's working on the legacy driver
anymore you only need to chase this once and then be done. And it /should/
allow you to implement standard gem behaviour by default for all these
standard ioctls that generic userspace expects to work, and only use the
old behaviour when you detect the legacy userspace and have set the
fpriv->uses_legacy_buffer_handles flag.

A really bonkers fix would be to merge the buffer handle xarrays into one
and use one of the tag bits to denote whether it's a legacy or gem one,
and then allow GEM_CLOSE to close them all. But that's really ugly and
probably way too much surgery.

Cheers, Sima
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch