[RFC] Using DC in amdgpu for upcoming GPU

Thu Dec 8 20:07:26 UTC 2016

> I can't dig into details of DC, so this is not a 100% assessment, but if
> you call a function called "validate" in atomic_commit, you're very, very
> likely breaking atomic. _All_ validation must happen in ->atomic_check,
> if that's not the case TEST_ONLY mode is broken. And atomic userspace is
> relying on that working.
>
> The only thing that you're allowed to return from ->atomic_commit is
> out-of-memory, hw-on-fire and similar unforeseen and catastrophic issues.
> Kerneldoc expklains this.
>
> Now the reason I bring this up (and we've discussed it at length in
> private) is that DC still suffers from a massive abstraction midlayer. A
> lot of the back-end stuff (dp aux, i2c, abstractions for allocation,
> timers, irq, ...) have been cleaned up, but the midlayer is still there.
> And I understand why you have it, and why it's there - without some OS
> abstraction your grand plan of a unified driver across everything doesn't
> work out so well.
>
> But in a way the backend stuff isn't such a big deal. It's annoying since
> lots of code, and bugfixes have to be duplicated and all that, but it's
> fairly easy to fix case-by-case, and as long as AMD folks stick around
> (which I fully expect) not a maintainance issue. It makes it harder for
> others to contribute, but then since it's mostly the leaf it's generally
> easy to just improve the part you want to change (as an outsider). And if
> you want to improve shared code the only downside is that you can't also
> improve amd, but that's not so much a problem for non-amd folks ;-)
>
> The problem otoh with the abstraction layer between drm core and the amd
> driver is that you can't ignore if you want to refactor shared code. And
> because it's an entire world of its own, it's much harder to understand
> what the driver is doing (without reading it all). Some examples of what I
> mean:
>
> - All other drm drivers subclass drm objects (by embedding them) into the
>   corresponding hw part that most closely matches the drm object's
>   semantics. That means even when you have 0 clue about how a given piece
>   of hw works, you have a reasonable chance of understanding code. If it's
>   all your own stuff you always have to keep in minde the special amd
>   naming conventions. That gets old real fast if you trying to figure out
>   what 20+ (or are we at 30 already?) drivers are doing.
>
> - This is even more true for atomic. Atomic has a pretty complicated
>   check/commmit transactional model for updating display state. It's a
>   standardized interface, and it's extensible, and we want generic
>   userspace to be able to run on any driver. Fairly often we realize that
>   semantics of existing or newly proposed properties and state isn't
>   well-defined enough, and then we need to go&read all the drivers and
>   figure out how to fix up the mess. DC has it's entirely separate state
>   structures which again don't subclass the atomic core structures (afaik
>   at least). Again the same problems apply that you can't find things, and
>   that figuring out the exact semantics and spotting differences in
>   behaviour is almost impossible.
>
> - The trouble isn't just in reading code and understanding it correctly,
>   it's also in finding it. If you have your own completely different world
>   then just finding the right code is hard - cscope and grep fail to work.
>
> - Another issue is that very often we unify semantics in drivers by adding
>   some new helpers that at least dtrt for most of the drivers. If you have
>   your own world then the impendance mismatch will make sure that amd
>   drivers will have slightly different semantics, and I think that's not
>   good for the ecosystem and kms - people want to run a lot more than just
>   a boot splash with generic kms userspace, stuff like xf86-video-$vendor
>   is going out of favour heavily.
>
> Note that all this isn't about amd walking away and leaving an
> unmaintainable mess behind. Like I've said I don't think this is a big
> risk. The trouble is that having your own world makes it harder for
> everyone else to understand the amd driver, and understanding all drivers
> is very often step 1 in some big refactoring or feature addition effort.
> Because starting to refactor without understanding the problem generally
> doesn't work ;_) And you can't make this step 1 easier for others by
> promising to always maintain DC and update it to all the core changes,
> because that's only step 2.
>
> In all the DC discussions we've had thus far I haven't seen anyone address
> this issue. And this isn't just an issue in drm, it's pretty much
> established across all linux subsystems with the "no midlayer or OS
> abstraction layers in drivers" rule. There's some real solid reasons why
> such a HAl is extremely unpopular with upstream. And I haven't yet seen
> any good reason why amd needs to be different, thus far it looks like a
> textbook case, and there's been lots of vendors in lots of subsystems who
> tried to push their HAL.

Daniel has said this all very nicely, I'm going to try and be a bit more direct,
because apparently I've possibly been too subtle up until now.

No HALs. We don't do HALs in the kernel. We might do midlayers sometimes
we try not to do midlayers. In the DRM we don't do either unless the maintainers
are asleep. They might be worth the effort for AMD, however for the Linux kernel
they don't provide a benefit and make maintaining the code a lot harder. I've
maintained this code base for over 10 years now and I'd like to think
I've only merged
 something for semi-political reasons once (initial exynos was still
more Linuxy than DC),
and that thing took a lot of time to cleanup, I really don't feel like
saying yes again.

Given the choice between maintaining Linus' trust that I won't merge
100,000 lines
of abstracted HAL code and merging 100,000 lines of abstracted HAL code
I'll give you one guess where my loyalties lie. The reason the
toplevel maintainer (me)
doesn't work for Intel or AMD or any vendors, is that I can say NO
when your maintainers
can't or won't say it.

I've only got one true power as a maintainer, and that is to say No.
The other option
is I personally sit down and rewrite all the code in an acceptable
manner, and merge that
instead. But I've discovered I probably don't scale to that level, so
again it leaves me
with just the one actual power.

AMD can't threaten not to support new GPUs in upstream kernels without
merging this,
that is totally something you can do, and here's the thing Linux will
survive, we'll piss off
a bunch of people, but the Linux kernel will just keep on rolling
forward, maybe at some
point someone will get pissed about lacking upstream support for your
HW and go write
support and submit it, maybe they won't. The kernel is bigger than any
of us and has
standards about what is acceptable. Read up on the whole mac80211
problems we had
years ago, where every wireless vendor wrote their own 80211 layer
inside their driver,
there was a lot of time spent creating a central 80211 before any of
those drivers were
suitable for merge, well we've spent our time creating a central
modesetting infrastructure,
bypassing it is taking a driver in totally the wrong direction.

I've also wondered if the DC code is ready for being part of the
kernel anyways, what
happens if I merge this, and some external contributor rewrites 50% of
it and removes a
bunch of stuff that the kernel doesn't need. By any kernel standards
I'll merge that sort
of change over your heads if Alex doesn't, it might mean you have to
rewrite a chunk
of your internal validation code, or some other interactions, but
those won't be reasons
to block the changes from my POV. I'd like some serious introspection
on your team's
part on how you got into this situation and how even if I was feeling
like merging this
(which I'm not) how you'd actually deal with being part of the Linux
kernel and not hiding
in nicely framed orgchart silo behind a HAL. I honestly don't think
the code is Linux worthy
code, and I also really dislike having to spend my Friday morning
being negative about it,
but hey at least I can have a shower now.

No.

Dave.