[PATCH 01/13] drm/amdgpu: introduce and honour DRM_FORCE_AUTH workaround

Wed May 29 13:14:52 UTC 2019

Am 29.05.19 um 15:03 schrieb Emil Velikov:
> On 2019/05/29, Dave Airlie wrote:
>> On Wed, 29 May 2019 at 02:47, Emil Velikov <emil.l.velikov at gmail.com> wrote:
>>> On 2019/05/28, Koenig, Christian wrote:
>>>> Am 28.05.19 um 18:10 schrieb Emil Velikov:
>>>>> On 2019/05/28, Daniel Vetter wrote:
>>>>>> On Tue, May 28, 2019 at 10:03 AM Koenig, Christian
>>>>>> <Christian.Koenig at amd.com> wrote:
>>>>>>> Am 28.05.19 um 09:38 schrieb Daniel Vetter:
>>>>>>>> [SNIP]
>>>>>>>>> Might be a good idea looking into reverting it partially, so that at
>>>>>>>>> least command submission and buffer allocation is still blocked.
>>>>>>>> I thought the issue is a lot more than vainfo, it's pretty much every
>>>>>>>> hacked up compositor under the sun getting this wrong one way or
>>>>>>>> another. Thinking about this some more, I also have no idea how you'd
>>>>>>>> want to deprecate rendering on primary nodes in general. Apparently
>>>>>>>> that breaks -modesetting already, and probably lots more compositors.
>>>>>>>> And it looks like we're finally achieve the goal kms set out to 10
>>>>>>>> years ago, and new compositors are sprouting up all the time. I guess
>>>>>>>> we could just break them all (on new hardware) and tell them to all
>>>>>>>> suck it up. But I don't think that's a great option. And just
>>>>>>>> deprecating this on amdgpu is going to be even harder, since then
>>>>>>>> everywhere else it'll keep working, and it's just amdgpu.ko that looks
>>>>>>>> broken.
>>>>>>>>
>>>>>>>> Aside: I'm not supporting Emil's idea here because it fixes any issues
>>>>>>>> Intel has - Intel doesn't care. I support it because reality sucks,
>>>>>>>> people get this render vs. primary vs. multi-gpu prime wrong all the
>>>>>>>> time (that's also why we have hardcoded display+gpu pairs in mesa for
>>>>>>>> the various soc combinations out there), and this looks like a
>>>>>>>> pragmatic solution. It'd be nice if every compositor and everything
>>>>>>>> else would perfectly support multi gpu and only use render nodes for
>>>>>>>> rendering, and only primary nodes for display. But reality is that
>>>>>>>> people hack on stuff until gears on screen and then move on to more
>>>>>>>> interesting things (to them). So I don't think we'll ever win this :-/
>>>>>>> Yeah, but this is a classic case of working around user space issues by
>>>>>>> making kernel changes instead of fixing user space.
>>>>>>>
>>>>>>> Having privileged (output control) and unprivileged (rendering control)
>>>>>>> functionality behind the same node is a mistake we have made a long time
>>>>>>> ago and render nodes finally seemed to be a way to fix that.
>>>>>>>
>>>>>>> I mean why are compositors using the primary node in the first place?
>>>>>>> Because they want to have access to privileged resources I think and in
>>>>>>> this case it is perfectly ok to do so.
>>>>>>>
>>>>>>> Now extending unprivileged access to the primary node actually sounds
>>>>>>> like a step into the wrong direction to me.
>>>>>>>
>>>>>>> I rather think that we should go down the route of completely dropping
>>>>>>> command submission and buffer allocation through the primary node for
>>>>>>> non master clients. And then as next step at some point drop support for
>>>>>>> authentication/flink.
>>>>>>>
>>>>>>> I mean we have done this with UMS as well and I don't see much other way
>>>>>>> to move forward and get rid of those ancient interface in the long term.
>>>>>> Well kms had some really good benefits that drove quick adoption, like
>>>>>> "suspend/resume actually has a chance of working" or "comes with
>>>>>> buffer management so you can run multiple gears".
>>>>>>
>>>>>> The render node thing is a lot more niche use case (prime, better priv
>>>>>> separation), plus "it's cleaner design". And the "cleaner design" part
>>>>>> is something that empirically doesn't seem to matter :-/ Just two
>>>>>> examples:
>>>>>> - KHR_display/leases just iterated display resources on the fd needed
>>>>>> for rendering (and iirc there was even a patch to expose that for
>>>>>> render nodes too so it works with DRI3), because implementing
>>>>>> protocols is too hard. Barely managed to stop that one before it
>>>>>> happened.
>>>>>> - Various video players use the vblank ioctl on directly to schedule
>>>>>> frames, without telling the compositor. I discovered that when I
>>>>>> wanted to limite the vblank ioctl to master clients only. Again,
>>>>>> apparently too hard to use the existing extensions, or fix the bugs in
>>>>>> there, or whatever. One userspace got fixed last year, but it'll
>>>>>> probably get copypasted around forever :-/
>>>>>>
>>>>>> So I don't think we'll ever manage to roll a clean split out, and best
>>>>>> we can do is give in and just hand userspace what it wants. As much as
>>>>>> that's misguided and unclean and all that. Maybe it'll result in a
>>>>>> least fewer stuff getting run as root to hack around this, because
>>>>>> fixing properly seems not to be on the table.
>>>>>>
>>>>>> The beauty of kms is that we've achieved the mission, everyone's
>>>>>> writing their own thing. Which is also terrible, and I don't think
>>>>>> it'll get better.
>>>>> With the risk of coming rude I will repeat my earlier comment:
>>>>>
>>>>> The problem is _neither_ Intel nor libva specific.
>>>>>
>>>>>
>>>>>
>>>>> That said, let's step back for a moment and consider:
>>>>>
>>>>>    - the "block everything but KMS via the primary node" idea is great but
>>>>> orthogonal
>>>>>
>>>>>    - the series does address issues that are vendor-agnostic
>>>>>
>>>>>    - by default this series does _not_ cause any regression be that for
>>>>> new or old userspace
>>>>>
>>>>>    - there are two trivial solutions, if the AMD team has concerns about
>>>>> closed-source/private stack depending on the old behaviour
>>>>> If they want I can even write the patches ;-)
>>>>>
>>>>>
>>>>> That said, the notable comments received so far are:
>>>>>    - rework patch 13/13 to remove the DRM_AUTH from prime fd to/from
>>>>> handle. I'm OK but this will change the return code - from EACCES to
>>>>> ENOSYS
>>>>>
>>>>>    - vmwgfx will need a check on the reference ioctl(s) - IIRC Thomas is
>>>>> planning to drop nearly all DRM_AUTH instances in their driver.
>>>>>
>>>>>
>>>>> Christian, as mentioned before - this series does _not_ add
>>>>> functionality to render nodes. It effectively paves a way towards
>>>>> removing DRM_AUTH.
>>>> But it adds functionality to the primary node.
>>>>
>>> Behaviour is adjusted - functionality was there since day 1.
>>>
>>>>> I understand the series may feel a bit dirty. Yet I would gladly address
>>>>> any technical concerns you have.
>>>> Well putting compatibility issues aside my concern is that this is
>>>> simply a bad design decision which we can't revert later on.
>>>>
>>> As sad above - any concerns (theoretical or actual regressions) can be
>>> trivially fixed _without_ reverting any of this.
>>>
>>> I am more than happy to step up and address any regressions in timely
>>> manner.
>>>
>>>
>>> As a reminder without this series, some of your customers are forced to
>>> run their applications as root.
>> I'm torn here on whether this is worth it. Have we got more use cases
>> to justify it?
>>
> Should have mentioned: three DRM drivers (not counting i915) have
> dropped DRM_AUTH, assumingly for the same reasons I'm bringing here.
>
> Apart from the libva, kmscube + gst and mesa, I'm expecting other
> projects to make the same mistake. Since the former three define the
> norm of using DRM.
>
> The "fix" for all of these being "run as root" :-\
>
>> I'm wary of opening this up just because we can.
>>
> What can I do to alleviate that worry? I have spent over a week auditing
> code and designed so that we can reinstate the authentication only where
> needed.

Well I don't think the worry here is about regressions, but rather about 
a design decision we will never be able to revert.

So the question we have to ask is rather if it's a good design decision 
to resurrect the primary node with all its related compability burdens 
to work around an issue which is essentially an userspace coding error.

And from my point of view the answer is relatively clear that this is 
not going to worth it.

Regards,
Christian.

>
> Thanks
> Emil