[PATCH 01/13] drm/amdgpu: introduce and honour DRM_FORCE_AUTH workaround

Fri May 31 12:20:30 UTC 2019

Am 29.05.19 um 18:29 schrieb Emil Velikov:
> On 2019/05/29, Koenig, Christian wrote:
>> Am 29.05.19 um 15:03 schrieb Emil Velikov:
>>> On 2019/05/29, Dave Airlie wrote:
>>>> On Wed, 29 May 2019 at 02:47, Emil Velikov <emil.l.velikov at gmail.com> wrote:
>>>>> On 2019/05/28, Koenig, Christian wrote:
>>>>>> Am 28.05.19 um 18:10 schrieb Emil Velikov:
>>>>>>> On 2019/05/28, Daniel Vetter wrote:
>>>>>>>> On Tue, May 28, 2019 at 10:03 AM Koenig, Christian
>>>>>>>> <Christian.Koenig at amd.com> wrote:
>>>>>>>>> Am 28.05.19 um 09:38 schrieb Daniel Vetter:
>>>>>>>>>> [SNIP]
>>>>>>>>>>> Might be a good idea looking into reverting it partially, so that at
>>>>>>>>>>> least command submission and buffer allocation is still blocked.
>>>>>>>>>> I thought the issue is a lot more than vainfo, it's pretty much every
>>>>>>>>>> hacked up compositor under the sun getting this wrong one way or
>>>>>>>>>> another. Thinking about this some more, I also have no idea how you'd
>>>>>>>>>> want to deprecate rendering on primary nodes in general. Apparently
>>>>>>>>>> that breaks -modesetting already, and probably lots more compositors.
>>>>>>>>>> And it looks like we're finally achieve the goal kms set out to 10
>>>>>>>>>> years ago, and new compositors are sprouting up all the time. I guess
>>>>>>>>>> we could just break them all (on new hardware) and tell them to all
>>>>>>>>>> suck it up. But I don't think that's a great option. And just
>>>>>>>>>> deprecating this on amdgpu is going to be even harder, since then
>>>>>>>>>> everywhere else it'll keep working, and it's just amdgpu.ko that looks
>>>>>>>>>> broken.
>>>>>>>>>>
>>>>>>>>>> Aside: I'm not supporting Emil's idea here because it fixes any issues
>>>>>>>>>> Intel has - Intel doesn't care. I support it because reality sucks,
>>>>>>>>>> people get this render vs. primary vs. multi-gpu prime wrong all the
>>>>>>>>>> time (that's also why we have hardcoded display+gpu pairs in mesa for
>>>>>>>>>> the various soc combinations out there), and this looks like a
>>>>>>>>>> pragmatic solution. It'd be nice if every compositor and everything
>>>>>>>>>> else would perfectly support multi gpu and only use render nodes for
>>>>>>>>>> rendering, and only primary nodes for display. But reality is that
>>>>>>>>>> people hack on stuff until gears on screen and then move on to more
>>>>>>>>>> interesting things (to them). So I don't think we'll ever win this :-/
>>>>>>>>> Yeah, but this is a classic case of working around user space issues by
>>>>>>>>> making kernel changes instead of fixing user space.
>>>>>>>>>
>>>>>>>>> Having privileged (output control) and unprivileged (rendering control)
>>>>>>>>> functionality behind the same node is a mistake we have made a long time
>>>>>>>>> ago and render nodes finally seemed to be a way to fix that.
>>>>>>>>>
>>>>>>>>> I mean why are compositors using the primary node in the first place?
>>>>>>>>> Because they want to have access to privileged resources I think and in
>>>>>>>>> this case it is perfectly ok to do so.
>>>>>>>>>
>>>>>>>>> Now extending unprivileged access to the primary node actually sounds
>>>>>>>>> like a step into the wrong direction to me.
>>>>>>>>>
>>>>>>>>> I rather think that we should go down the route of completely dropping
>>>>>>>>> command submission and buffer allocation through the primary node for
>>>>>>>>> non master clients. And then as next step at some point drop support for
>>>>>>>>> authentication/flink.
>>>>>>>>>
>>>>>>>>> I mean we have done this with UMS as well and I don't see much other way
>>>>>>>>> to move forward and get rid of those ancient interface in the long term.
>>>>>>>> Well kms had some really good benefits that drove quick adoption, like
>>>>>>>> "suspend/resume actually has a chance of working" or "comes with
>>>>>>>> buffer management so you can run multiple gears".
>>>>>>>>
>>>>>>>> The render node thing is a lot more niche use case (prime, better priv
>>>>>>>> separation), plus "it's cleaner design". And the "cleaner design" part
>>>>>>>> is something that empirically doesn't seem to matter :-/ Just two
>>>>>>>> examples:
>>>>>>>> - KHR_display/leases just iterated display resources on the fd needed
>>>>>>>> for rendering (and iirc there was even a patch to expose that for
>>>>>>>> render nodes too so it works with DRI3), because implementing
>>>>>>>> protocols is too hard. Barely managed to stop that one before it
>>>>>>>> happened.
>>>>>>>> - Various video players use the vblank ioctl on directly to schedule
>>>>>>>> frames, without telling the compositor. I discovered that when I
>>>>>>>> wanted to limite the vblank ioctl to master clients only. Again,
>>>>>>>> apparently too hard to use the existing extensions, or fix the bugs in
>>>>>>>> there, or whatever. One userspace got fixed last year, but it'll
>>>>>>>> probably get copypasted around forever :-/
>>>>>>>>
>>>>>>>> So I don't think we'll ever manage to roll a clean split out, and best
>>>>>>>> we can do is give in and just hand userspace what it wants. As much as
>>>>>>>> that's misguided and unclean and all that. Maybe it'll result in a
>>>>>>>> least fewer stuff getting run as root to hack around this, because
>>>>>>>> fixing properly seems not to be on the table.
>>>>>>>>
>>>>>>>> The beauty of kms is that we've achieved the mission, everyone's
>>>>>>>> writing their own thing. Which is also terrible, and I don't think
>>>>>>>> it'll get better.
>>>>>>> With the risk of coming rude I will repeat my earlier comment:
>>>>>>>
>>>>>>> The problem is _neither_ Intel nor libva specific.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> That said, let's step back for a moment and consider:
>>>>>>>
>>>>>>>     - the "block everything but KMS via the primary node" idea is great but
>>>>>>> orthogonal
>>>>>>>
>>>>>>>     - the series does address issues that are vendor-agnostic
>>>>>>>
>>>>>>>     - by default this series does _not_ cause any regression be that for
>>>>>>> new or old userspace
>>>>>>>
>>>>>>>     - there are two trivial solutions, if the AMD team has concerns about
>>>>>>> closed-source/private stack depending on the old behaviour
>>>>>>> If they want I can even write the patches ;-)
>>>>>>>
>>>>>>>
>>>>>>> That said, the notable comments received so far are:
>>>>>>>     - rework patch 13/13 to remove the DRM_AUTH from prime fd to/from
>>>>>>> handle. I'm OK but this will change the return code - from EACCES to
>>>>>>> ENOSYS
>>>>>>>
>>>>>>>     - vmwgfx will need a check on the reference ioctl(s) - IIRC Thomas is
>>>>>>> planning to drop nearly all DRM_AUTH instances in their driver.
>>>>>>>
>>>>>>>
>>>>>>> Christian, as mentioned before - this series does _not_ add
>>>>>>> functionality to render nodes. It effectively paves a way towards
>>>>>>> removing DRM_AUTH.
>>>>>> But it adds functionality to the primary node.
>>>>>>
>>>>> Behaviour is adjusted - functionality was there since day 1.
>>>>>
>>>>>>> I understand the series may feel a bit dirty. Yet I would gladly address
>>>>>>> any technical concerns you have.
>>>>>> Well putting compatibility issues aside my concern is that this is
>>>>>> simply a bad design decision which we can't revert later on.
>>>>>>
>>>>> As sad above - any concerns (theoretical or actual regressions) can be
>>>>> trivially fixed _without_ reverting any of this.
>>>>>
>>>>> I am more than happy to step up and address any regressions in timely
>>>>> manner.
>>>>>
>>>>>
>>>>> As a reminder without this series, some of your customers are forced to
>>>>> run their applications as root.
>>>> I'm torn here on whether this is worth it. Have we got more use cases
>>>> to justify it?
>>>>
>>> Should have mentioned: three DRM drivers (not counting i915) have
>>> dropped DRM_AUTH, assumingly for the same reasons I'm bringing here.
>>>
>>> Apart from the libva, kmscube + gst and mesa, I'm expecting other
>>> projects to make the same mistake. Since the former three define the
>>> norm of using DRM.
>>>
>>> The "fix" for all of these being "run as root" :-\
>>>
>>>> I'm wary of opening this up just because we can.
>>>>
>>> What can I do to alleviate that worry? I have spent over a week auditing
>>> code and designed so that we can reinstate the authentication only where
>>> needed.
>> Well I don't think the worry here is about regressions,
> Glad to hear.
>
>> but rather about
>> a design decision we will never be able to revert.
>>
> Can you think of any reason/issue why we would want to revert this? I
> will gladly spend some thing exploring how to address it.

Well, to finally get rid of the primary node for non display hardware.

And in general to have a clean separation between display and rendering.

>> So the question we have to ask is rather if it's a good design decision
>> to resurrect the primary node with all its related compability burdens
>> to work around an issue which is essentially an userspace coding error.
>>
> Can see you're not happy on the topic - I'm not too excited either. The
> truth to the matter is - DRM drivers have dropped DRM_AUTH regardless of
> my work.

Then we should probably consider stopping doing this and enforce that 
the primary node is not used that widely any more.

Regards,
Christian.

>
> It's very unfortunate, if AMDGPU stands out. Perhaps after some time and
> unhappy users you'll reconsider.
>
> I believe that Linus has pointed out a number of times that kernel
> developers should care about our users. Even when it's an userspace
> error.
>
>
> HTH
> Emil