[PATCH] drm/amdgpu: fix drm leases being broken on radv

Koenig, Christian Christian.Koenig at amd.com
Thu Apr 18 08:56:49 UTC 2019


Am 18.04.19 um 10:26 schrieb Daniel Vetter:
> On Thu, Apr 18, 2019 at 08:06:03AM +0000, Koenig, Christian wrote:
>> Am 18.04.19 um 08:46 schrieb Daniel Vetter:
>>> On Wed, Apr 17, 2019 at 7:30 PM Koenig, Christian
>>> <Christian.Koenig at amd.com> wrote:
>>>> Am 17.04.19 um 17:51 schrieb Emil Velikov:
>>>>> Hi guys,
>>>>>
>>>>> On 2019/04/17, Daniel Vetter wrote:
>>>>>> On Wed, Apr 17, 2019 at 02:46:06PM +0200, Christian König wrote:
>>>>>>> Am 17.04.19 um 14:35 schrieb Daniel Vetter:
>>>>>>>> On Wed, Apr 17, 2019 at 12:06:32PM +0000, Koenig, Christian wrote:
>>>>>>>>> Am 17.04.19 um 14:00 schrieb Daniel Vetter:
>>>>>>>>>> On Wed, Apr 17, 2019 at 11:18:35AM +0000, Koenig, Christian wrote:
>>>>>>>>>>> Am 17.04.19 um 13:06 schrieb Daniel Vetter:
>>>>>>>>>>>> On Wed, Apr 17, 2019 at 12:29 PM Koenig, Christian
>>>>>>>>>>>> <Christian.Koenig at amd.com> wrote:
>>>>>>>>>>>> [SNIP]
>>>>>>>>>>> Well, what you guys did here is a serious no-go.
>>>>>>>>>> Not really understanding your reasons for an unconditional nak on all this
>>>>>>>>>> still.
>>>>>>>>> Well, let me summarize: You worked around a problem with libva by
>>>>>>>>> breaking another driver and now proposing a rather ugly and only halve
>>>>>>>>> backed workaround for that driver.
>>>>>>>>>
>>>>>>>>> This is a serious no-go. I've already send out a i915 specific change to
>>>>>>>>> remove DRM_AUTH from IOCTLs which also have DRM_RENDER_ALLOW and revert
>>>>>>>>> the offending patch.
>>>>>>>> Oh I'm totally fine with the revert if that's what we go with. But that
>>>>>>>> still leaves the issue that it would make sense to drop DRM_AUTH from
>>>>>>>> DRIVER_RENDER capable drivers (not just for libva, but in general). And
>>>>>>>> there's fundamentally 2 options to do that:
>>>>>>>>
>>>>>>>> - This one here (or anything implementing the same idea), to keep radv
>>>>>>>>       working.
>>>>>>>>
>>>>>>>> - We drop DRM_AUTH from all drivers with DRIVER_RENDER except amdgpu. How
>>>>>>>>       exactly that's doen doesn't matter, but it leaves amdgpu out in the
>>>>>>>>       cold. It also doesn't matter whether we get there with revert + lots of
>>>>>>>>       patches, or a special hack for amdgpu, in the end amdgpu would be
>>>>>>>>       different. Also note that your patch isn't enough, since it doesn't fix
>>>>>>>>       the core ioctls.
>>>>>>>>
>>>>>>>> I think from an overall platform pov, the first is the better option.
>>>>>>>> After all we try really hard to avoid driver special cases for these kinds
>>>>>>>> of things.
>>>>>>> Well, I initially thought that radv is doing something unusual or broken,
>>>>>>> but after looking deeper into this that is actually not the case.
>>>>>>>
>>>>>>> What radv does is calling an IOCTL and expecting an error message when it is
>>>>>>> not authenticated.
>>>>>>>
>>>>>>> It looks like libdrm_amdgpu has switched to using DRM_IOCTL_GET_CLIENT to
>>>>>>> figure out if it is authenticated or not, but I clearly remember that we
>>>>>>> haven't done this from the beginning.
>>>>>> Maybe in the radoen code, or some earlier internal amdgpu libdrm code?
>>>>>> First public commit has all the bits: getauth, GetVersion, then the accel
>>>>>> query.
>>>>>>
>>>>>>> Thinking more about this I don't think that this problem is actually amdgpu
>>>>>>> specific. So I wouldn't drop DRM_AUTH from all render node drivers at all,
>>>>>>> but only from those who have the specific issue with libva.
>>>>>> libva was the initial motivation, the goal of Emil's patch wasn't to just
>>>>>> duct-tape over libva. That would be easier to fix in libva. The point was
>>>>>> that we should be able to allow this in general.
>>>>>>
>>>>>> And we can, except for the radv issue uncovered here.
>>>>>>
>>>>>> So please don't get 100% hung up on the libva thing, that wasn't really
>>>>>> the goal, just the initial motivation. And I still thinks it makes for all
>>>>>> drivers. So again we have:
>>>>>>
>>>>>> - radv hack
>>>>>> - make amdgpu behave different from everyone else
>>>>>> - keep tilting windmills about "pls use rendernodes, thx"
>>>>>>
>>>>>> I neither like tilting windmills nor making drivers behave different for
>>>>>> no reaason at all.
>>>>> Allow me to jump-in some emails down the line.
>>>>>
>>>>> First and foremost, sincere apologies for upsetting you Christian. If
>>>>> it's of any consolidation - let me assure you the goal is _not_ to break
>>>>> amdgpu or any other driver.
>>>>>
>>>>> Secondly, I would like to ask you for a list of projects so we can look and
>>>>> investigate properly. So far you've mentioned libdrm_amdgpu - I'll look into
>>>>> it.
>>>> That is rather hard to come by, because you would also need to look at
>>>> all previously existing driver stacks.
>>>>
>>>> E.g. with the classic R300 driver for example.
>>>>
>>>>> On the topic of "working around libva" - sadly libva is _not_ the only
>>>>> offender. We also had bugs in mesa and kmscube.
>>>>>
>>>>> Considering those are taken as a prime example of "the right way", it's very
>>>>> likely for the mistakes to be repeated in many other projects.
>>>>>
>>>>> Where the common "fix" seems to be "run as root"...
>>>>>
>>>>>
>>>>> As Daniel pointed out, we could be fighting the windmills or we could have a
>>>>> small, admittedly ugly, workaround for amdgpu.
>>>>>
>>>>> If you don't like that workaround in the driver we could move it to core DRM.
>>>>>
>>>>> In either case, I would like to focus on a pragramic solution which works with
>>>>> both old and new userspace.
>>>> Well, I seriously think the original committed solution could cause a
>>>> lot of problems and the issue with radv is only the tip of the iceberg.
>>>>
>>>> I mean we just have to ask our self why have we created render nodes in
>>>> the first place? The obvious alternative was to just removed the
>>>> authentication restrictions on the primary node which would have been
>>>> way less code and maintenance burden.
>>> The render nodes exist so you can have different unix permissions for
>>> rendering as for displaying stuff. E.g. run the compositor as a
>>> distinct user from all its clients. The other reason was to have a
>>> clean interface without any of the historical baggage (to make stuff
>>> like container/vm pass-through easier, since we'd know that
>>> render-node userspace won't ever touch any of the old crap ioctls). I
>>> don't think there's ever been a concern about breaking backwards
>>> compatibility.
>>>
>>> The issue here is that radv checks for "can I render", when it wants
>>> to check for "can I display". So anything that only renders we can
>>> ignore. That leaves the various ddx around, which historically have
>>> run as root, or if they run as non-root, logind or similar is giving
>>> them the master kms node already in master mode (afaiui at least).
>>> That leaves the non-main compositor userspace, of which there's only
>>> vkdisplay. I don't see much additional possibilities for regressions
>>> to creep in.
>>>
>>> Emil probably has a better grasp on this, since he's got a bunch of
>>> patches to roll out drmIsMaster all over.
>>>
>>>> I need to dig up the mailing list archive, but I strongly think that one
>>>> of the main arguments for this approach was to NOT break existing userspace.
>>>>
>>>> Even taking into account that we now don't need to deal with UMS and
>>>> really really old userspace drivers any more it still feels like a to
>>>> high risk going down that route.
>>> We change the undefined corners of the uapi all the time to improve
>>> the overall ecosystem, and occasionally we end up with
>>> DRIVER_FOO_IS_TERRIBLY to hack around it. It happens, and I still
>>> think the choice here is how exactly amdgpu.ko copes with radv:
>>> - the hack proposed here
>>> - a DRIVER_AUTH_MEANS_AUTH flag to give amdgpu the old behaviour, and
>>> let you folks deal with a (likely, not guaranteed) influx "but this
>>> works on $any_other_driver"
>>>
>>> I guess up to you which patch Emil should include when he resubmits
>>> the patch for 5.3. We've done both in the past, but generally we're
>>> trying to limit the impact of these hacks as much as possible. Hence
>>> why I'm still in favour of the first one.
>> Well and I will still NAK both approaches.
>>
>> You can of course go ahead and remove the DRM_AUTH flags from the i915
>> IOCTLs to make libva happy, but please not a general approach which
>> touches all DRM drivers.
>>
>> And from Dave's response I think he is following my argumentation here,
>> so I don't see a general approach to be added for 5.3 either.
>>
>>> "keep tilting windmills" is imo not an option, and I'm happy to handle
>>> any other fallout Emil's patch uncovers, if there is anything else - I
>>> did sketch this hack here quickly on irc right when we got the report.
>> Well this is not about tilting windmills, but rather good maintenance of
>> backward compatibility and rejecting changing with potential unforeseen
>> consequences.
> We (well I) have been doing stuff like this since forever. It's just that
> thus far it hasn't been amd that stuck out with the need for a hack, but
> other drivers. It's a tradeoff in the end, and a tradeoff we're can do due
> to our open source userspace requirement - in case something does go
> sideways, we have the full history of anything affected. So you're
> proposed we revert all these other changes and cleanups too, because they
> could be risky (many turned out to actually be risky, like this one here
> too)?

Well in general yes, a lot of those changes looked unnecessary risky to me.

We of course can't revert those changes now because they are now UAPI as 
well and it would be risky to modify them. But I think we should have a 
much stricter approach to modifying exiting UAPI.

Additional to that I actually think that the requirement of open source 
userspace stacks is a mistake.

This requirement came just from exactly that approach to try to modify 
exiting UAPI instead of accepting the fact that UAPI backward 
compatibility is something you should not mess with.

Nobody else in the kernel is having that open source user space 
requirement and it actually seems that a lot of people see that as just 
a justification to keep Mesa alive.

What you can do is to require unit tests so that everybody is able to 
test for regressions independent of installed userspace software.

> Ok correction: amd has stuck out in the past too, there was some vblank vs
> pageflip stuff where we needed to do some pretty clever tricks to both
> have the new stricter semantics for everyone, while keeping the amd ddx
> happy still. This aint new at all, we fixed up the regressions and moved
> on.

I'm wasn't much involved into this, so I can't say anything about it.

> tldr; You're categorical rejection doesn't make sense to me.
> Slightly over the top, but this feels like gut reaction to radv being a
> thorn for amd.

Well radv is pointing out exactly what's going wrong with AMDs internal 
development process, so I'm really in favor of that.

And I'm certainly in favor or open source projects in general, but as 
kernel developer I must not favor a closed or open userspace stack 
running on top of my driver.

What I care about is good UAPI design, maintainable code and not 
re-inventing the wheel to often, but what stack runs on top is a 
political decision and not a technical one.

Regards,
Christian.

> -Daniel



More information about the dri-devel mailing list