[PATCH] drm/amdgpu: fix drm leases being broken on radv

Thu Apr 18 06:46:33 UTC 2019

On Wed, Apr 17, 2019 at 7:30 PM Koenig, Christian
<Christian.Koenig at amd.com> wrote:
>
> Am 17.04.19 um 17:51 schrieb Emil Velikov:
> > Hi guys,
> >
> > On 2019/04/17, Daniel Vetter wrote:
> >> On Wed, Apr 17, 2019 at 02:46:06PM +0200, Christian König wrote:
> >>> Am 17.04.19 um 14:35 schrieb Daniel Vetter:
> >>>> On Wed, Apr 17, 2019 at 12:06:32PM +0000, Koenig, Christian wrote:
> >>>>> Am 17.04.19 um 14:00 schrieb Daniel Vetter:
> >>>>>> On Wed, Apr 17, 2019 at 11:18:35AM +0000, Koenig, Christian wrote:
> >>>>>>> Am 17.04.19 um 13:06 schrieb Daniel Vetter:
> >>>>>>>> On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian
> >>>>>>>> <Christian.Koenig at amd.com> wrote:
> >>>>>>>> [SNIP]
> >>>>>>> Well, what you guys did here is a serious no-go.
> >>>>>> Not really understanding your reasons for an unconditional nak on all this
> >>>>>> still.
> >>>>> Well, let me summarize: You worked around a problem with libva by
> >>>>> breaking another driver and now proposing a rather ugly and only halve
> >>>>> backed workaround for that driver.
> >>>>>
> >>>>> This is a serious no-go. I've already send out a i915 specific change to
> >>>>> remove DRM_AUTH from IOCTLs which also have DRM_RENDER_ALLOW and revert
> >>>>> the offending patch.
> >>>> Oh I'm totally fine with the revert if that's what we go with. But that
> >>>> still leaves the issue that it would make sense to drop DRM_AUTH from
> >>>> DRIVER_RENDER capable drivers (not just for libva, but in general). And
> >>>> there's fundamentally 2 options to do that:
> >>>>
> >>>> - This one here (or anything implementing the same idea), to keep radv
> >>>>     working.
> >>>>
> >>>> - We drop DRM_AUTH from all drivers with DRIVER_RENDER except amdgpu. How
> >>>>     exactly that's doen doesn't matter, but it leaves amdgpu out in the
> >>>>     cold. It also doesn't matter whether we get there with revert + lots of
> >>>>     patches, or a special hack for amdgpu, in the end amdgpu would be
> >>>>     different. Also note that your patch isn't enough, since it doesn't fix
> >>>>     the core ioctls.
> >>>>
> >>>> I think from an overall platform pov, the first is the better option.
> >>>> After all we try really hard to avoid driver special cases for these kinds
> >>>> of things.
> >>> Well, I initially thought that radv is doing something unusual or broken,
> >>> but after looking deeper into this that is actually not the case.
> >>>
> >>> What radv does is calling an IOCTL and expecting an error message when it is
> >>> not authenticated.
> >>>
> >>> It looks like libdrm_amdgpu has switched to using DRM_IOCTL_GET_CLIENT to
> >>> figure out if it is authenticated or not, but I clearly remember that we
> >>> haven't done this from the beginning.
> >> Maybe in the radoen code, or some earlier internal amdgpu libdrm code?
> >> First public commit has all the bits: getauth, GetVersion, then the accel
> >> query.
> >>
> >>> Thinking more about this I don't think that this problem is actually amdgpu
> >>> specific. So I wouldn't drop DRM_AUTH from all render node drivers at all,
> >>> but only from those who have the specific issue with libva.
> >> libva was the initial motivation, the goal of Emil's patch wasn't to just
> >> duct-tape over libva. That would be easier to fix in libva. The point was
> >> that we should be able to allow this in general.
> >>
> >> And we can, except for the radv issue uncovered here.
> >>
> >> So please don't get 100% hung up on the libva thing, that wasn't really
> >> the goal, just the initial motivation. And I still thinks it makes for all
> >> drivers. So again we have:
> >>
> >> - radv hack
> >> - make amdgpu behave different from everyone else
> >> - keep tilting windmills about "pls use rendernodes, thx"
> >>
> >> I neither like tilting windmills nor making drivers behave different for
> >> no reaason at all.
> > Allow me to jump-in some emails down the line.
> >
> > First and foremost, sincere apologies for upsetting you Christian. If
> > it's of any consolidation - let me assure you the goal is _not_ to break
> > amdgpu or any other driver.
> >
> > Secondly, I would like to ask you for a list of projects so we can look and
> > investigate properly. So far you've mentioned libdrm_amdgpu - I'll look into
> > it.
>
> That is rather hard to come by, because you would also need to look at
> all previously existing driver stacks.
>
> E.g. with the classic R300 driver for example.
>
> > On the topic of "working around libva" - sadly libva is _not_ the only
> > offender. We also had bugs in mesa and kmscube.
> >
> > Considering those are taken as a prime example of "the right way", it's very
> > likely for the mistakes to be repeated in many other projects.
> >
> > Where the common "fix" seems to be "run as root"...
> >
> >
> > As Daniel pointed out, we could be fighting the windmills or we could have a
> > small, admittedly ugly, workaround for amdgpu.
> >
> > If you don't like that workaround in the driver we could move it to core DRM.
> >
> > In either case, I would like to focus on a pragramic solution which works with
> > both old and new userspace.
>
> Well, I seriously think the original committed solution could cause a
> lot of problems and the issue with radv is only the tip of the iceberg.
>
> I mean we just have to ask our self why have we created render nodes in
> the first place? The obvious alternative was to just removed the
> authentication restrictions on the primary node which would have been
> way less code and maintenance burden.

The render nodes exist so you can have different unix permissions for
rendering as for displaying stuff. E.g. run the compositor as a
distinct user from all its clients. The other reason was to have a
clean interface without any of the historical baggage (to make stuff
like container/vm pass-through easier, since we'd know that
render-node userspace won't ever touch any of the old crap ioctls). I
don't think there's ever been a concern about breaking backwards
compatibility.

The issue here is that radv checks for "can I render", when it wants
to check for "can I display". So anything that only renders we can
ignore. That leaves the various ddx around, which historically have
run as root, or if they run as non-root, logind or similar is giving
them the master kms node already in master mode (afaiui at least).
That leaves the non-main compositor userspace, of which there's only
vkdisplay. I don't see much additional possibilities for regressions
to creep in.

Emil probably has a better grasp on this, since he's got a bunch of
patches to roll out drmIsMaster all over.

> I need to dig up the mailing list archive, but I strongly think that one
> of the main arguments for this approach was to NOT break existing userspace.
>
> Even taking into account that we now don't need to deal with UMS and
> really really old userspace drivers any more it still feels like a to
> high risk going down that route.

We change the undefined corners of the uapi all the time to improve
the overall ecosystem, and occasionally we end up with
DRIVER_FOO_IS_TERRIBLY to hack around it. It happens, and I still
think the choice here is how exactly amdgpu.ko copes with radv:
- the hack proposed here
- a DRIVER_AUTH_MEANS_AUTH flag to give amdgpu the old behaviour, and
let you folks deal with a (likely, not guaranteed) influx "but this
works on $any_other_driver"

I guess up to you which patch Emil should include when he resubmits
the patch for 5.3. We've done both in the past, but generally we're
trying to limit the impact of these hacks as much as possible. Hence
why I'm still in favour of the first one.

"keep tilting windmills" is imo not an option, and I'm happy to handle
any other fallout Emil's patch uncovers, if there is anything else - I
did sketch this hack here quickly on irc right when we got the report.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch