[PATCH] drm/amdgpu: fix drm leases being broken on radv

Thu Apr 18 10:35:34 UTC 2019

On Thu, Apr 18, 2019 at 12:04 PM Michel Dänzer <michel at daenzer.net> wrote:
>
> On 2019-04-18 11:46 a.m., Daniel Vetter wrote:
> > On Thu, Apr 18, 2019 at 11:22 AM Michel Dänzer <michel at daenzer.net> wrote:
> >> On 2019-04-18 11:11 a.m., Daniel Vetter wrote:
> >>> On Thu, Apr 18, 2019 at 10:52 AM Michel Dänzer <michel at daenzer.net> wrote:
> >>>> On 2019-04-18 10:26 a.m., Daniel Vetter wrote:
> >>>>>
> >>>>> Ok correction: amd has stuck out in the past too, there was some vblank vs
> >>>>> pageflip stuff where we needed to do some pretty clever tricks to both
> >>>>> have the new stricter semantics for everyone, while keeping the amd ddx
> >>>>> happy still. This aint new at all, we fixed up the regressions and moved
> >>>>> on.
> >>>>
> >>>> That sounds like something I would have been involved in, but I'm not
> >>>> sure what you mean...
> >>>>
> >>>> I fixed the amdgpu/radeon kernel drivers to obey the general KMS rules
> >>>> for vblank vs page flips, and added DRM_MODE_PAGE_FLIP_TARGET to allow
> >>>> userspace to take advantage of our hardware's capabilities to avoid
> >>>> delaying a page flip by one refresh cycle in some cases. That's all.
> >>>
> >>> That's the one I meant. At least I remember quite a pile of
> >>> discussions around why i915 gets to define how this stuff works, and
> >>> amdgpu/radeon having to follow.
> >>>
> >>> [...]
> >>>
> >>> - the vblank vs flip clarifiation that required new uapi and upgraded
> >>> amd ddx (really not pretty if we have to somewhat break a performance
> >>> feature for some users)
> >>
> >> You're totally mis-representing that IMO.
> >>
> >> The amdgpu/radeon kernel drivers were simply broken WRT established UAPI
> >> rules, in a way which affected our DDX drivers as well. The kernel
> >> driver fixes didn't require any corresponding userspace changes.
> >
> > It wasn't ever defined when we merged the page-flip ioctl, and the
> > vblank ioctl was retrofitted/abused to also work with kms. i915
> > happened to guarantee that a flip immediately after a vblank wait will
> > hit the next vblank, radeon (this stuff is really old) had more wiggle
> > room. And for unthrottled vblank, this mattered since it would allow
> > you to update frames still, even when the rendering was really late.
> >
> > The a lot of people started writing kms compositors, and due to random
> > reasons they mostly developed those on i915. Relying on the
> > i915-specific guarantee between vblank waits and page-flips.
> >
> > Then the entire atomic saga happened, and since I spent way too much
> > time reading other kms drivers already I wanted to standardize a bunch
> > of the uapi grey areas. This was starting with 2014/2015. Stuff I
> > still remember:
> > [...]
> > - and later on vblank vs timestamp, when the nonblocking helpers landed
> >
> > As a result, even more people started relying on this stuff.
> > Eventually someone noticed that amdgpu works differently. There was a
> > pretty heated irc debate between you and me about why exactly amdgpu
> > needs to follow the i915 standard, and why intel gets to set all these
> > standards and a few others things. This was around 2016 or so, per git
> > blame of relevant at least.
>
> Sounds like I did some knee-jerk ranting on IRC at the time, sorry.
>
>
> > We ended up adjusting amd drivers to follow i915 semantics, which
> > fixed scheduled flips on lots of compositors,
>
> It also fixed them with our DDX drivers, as otherwise there was no way
> for userspace to prevent page flips from completing one refresh cycle
> too early.
>
> > and broke unthrottled flipping (very slightly, but it did) on amd.
> > page_flip_target remedied that.
>
> That's just a minor optimization.
>
>
> >> New UAPI was added to allow taking advantage of hardware capabilities in
> >> some cases without breaking other cases.
> >>
> >> It's a pretty text-book example of how to manage UAPI.
> >
> > Well I guess we remember different aspects of that story, but to me it
> > was a textbook case of how retrofitting uapi has challenges.
>
> I don't agree "retrofitting" is accurate in this case. This was always
> the intention for the UAPI, the amdgpu/radeon kernel drivers just
> accidentally didn't implement it correctly.
>
> > One of the options you proposed was that we'd have a flag to select the i915
> > behaviour, and leave the default undefined.
>
> So I made a bad proposal at the time. :) I honestly didn't even remember
> that. I really don't see that episode having much to do with the current
> discussion.

Yeah it's definitely a lot less serious uapi retrofitting than what
we're discussing here now. But I do see a lot of parallels:

- We've implemented rendernodes as a clean/new uapi years ago. git
says we've made them the default in 2014.

- Lots of work happened to make render nodes the main thing. Despite
all that effort (well over 5 years, and a few more, initial render
nodes merge was 2013), adoption is fairly lack-luster and defacto uapi
is "use primary nodes, cause it works if you're root".

We can keep telling everyone they're wrong and that they should use
render nodes, or we can give in and make the uapi fit more with what
people out there seem to actually use. What I don't want to do is some
gradual thing where we slowly change things one driver at a time (like
Christian's patch proposes, starting with i915), cause that leads to
lots of confusion and eventually a lone driver standing in the rain
:-/

Now maybe the right approach is to keep telling everyone they should
use render nodes, but I'm not sure another 5 years is really going to
move this needle. But it is a tradeoff between doing the right thing
from how uapi should evolve, and the pragmatic thing of shrugging and
moving on and implementing what seems to be actually in use. And we've
done quite a bit of the pragmatic kind of uapi
clarifictation/adjusting to actual reality and mostly gotten away with
it. This here doesn't seem like a much different case.

Anyway, to unstuck this discussion a bit I chatted with Emil on irc,
and came up with a new idea for an rfc patch series:
- include both options discussed here for handling amdgpu/radv
- include the original patch again
- include patches for all DRIVER_RENDER drivers to drop DRM_AUTH from
them. That should give us a lot more eyes and reviewers to figure out
whether there's other place this might break userspace.

If that results in lots of negative feedback, we can confidently drop
this idea on the floor as a bad one, and will have to keep telling
everyone to use rendernodes more. If not, then we should know a lot
better whether it will blow up somewhere else than for radv only. And
Emil seems happy to type the patches.

Cheers, Daniel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch