[RFC] Using DC in amdgpu for upcoming GPU

Fri Dec 9 19:59:33 UTC 2016

I guess things went a bit sideways by me and Dave only talking about
the midlayer, so let me first state that the DC stuff has massively
improved through replacing all the backend services that reimplemented
Linux helper libraries with their native equivalent. That's some
serious work, and it shows that AMD is committed to doing the right
thing.

I absolutely didn't want to belittle all that effort by only raising
what I see is the one holdover left.

On Fri, Dec 9, 2016 at 6:26 PM, Cheng, Tony <Tony.Cheng at amd.com> wrote:
>> Merging this code as well as maintaining a trust relationship with
>> Linus, also maintains a trust relationship with the Linux graphics
>> community and other drm contributors. There have been countless
>> requests from various companies and contributors to merge unsavoury
>> things over the years and we've denied them. They've all had the same
>> reasons behind why they couldn't do what we want and why we were
>> wrong, but lots of people have shown up who do get what we are at and
>> have joined the community and contributed drivers that conform to the standards.
>> Turning around now and saying well AMD ignored our directions, so
>> we'll give them a free pass even though we've denied you all the same
>> thing over time.
>
> I'd like to say that I acknowledge the good and hard work maintainers are doing.  You nor the community is wrong to say no. I understand where the no comes from.  If somebody wants to throw 100k lines into DAL I would say no as well.
>
>> If I'd given in and merged every vendor coded driver as-is we'd never
>> have progressed to having atomic modesetting, there would have been
>> too many vendor HALs and abstractions that would have blocked forward
>> progression. Merging one HAL or abstraction is going to cause  pain,
>> but setting a precedent to merge more would be just downright stupid
>> maintainership.
>
>> Here's the thing, we want AMD to join the graphics community not hang
>> out inside the company in silos. We need to enable FreeSync on Linux,
>> go ask the community how would be best to do it, don't shove it inside
>> the driver hidden in a special ioctl. Got some new HDMI features that
>> are secret, talk to other ppl in the same position and work out a plan
>> for moving forward. At the moment there is no engaging with the Linux
>> stack because you aren't really using it, as long as you hide behind
>> the abstraction there won't be much engagement, and neither side
>> benefits, so why should we merge the code if nobody benefits?
>
>
>> The platform problem/Windows mindset is scary and makes a lot of
>> decisions for you, open source doesn't have those restrictions, and I
>> don't accept drivers that try and push those development model
>> problems into our codebase.
>
> I would like to share how platform problem/Windows mindset look from our side.  We are dealing with ever more complex hardware with the push to reduce power while driving more pixels through.  It is the power reduction that is causing us driver developers most of the pain.  Display is a high bandwidth real time memory fetch sub system which is always on, even when the system is idle.  When the system is idle, pretty much all of power consumption comes from display.  Can we use existing DRM infrastructure?  Definitely yes, if we talk about modes up to 300Mpix/s and leaving a lot of voltage and clock margin on the table.  How hard is it to set up a timing while bypass most of the pixel processing pipeline to light up a display?  How about adding all the power optimization such as burst read to fill display cache and keep DRAM in self-refresh as much as possible?  How about powering off some of the cache or pixel processing pipeline if we are not using them?  We need to manage and maximize valuable resources like cache (cache == silicon area == $$) and clock (== power) and optimize memory request patterns at different memory clock speeds, while DPM is going, in real time on the system.  This is why there is so much code to program registers, track our states, and manages resources, and it's getting more complex as HW would prefer SW program the same value into 5 different registers in different sub blocks to save a few cross tile wires on silicon and do complex calculations to find the magical optimal settings (the hated bandwidth_cals.c).  There are a lot of registers need to be programmed to correct values in the right situation if we enable all these power/performance optimizations.
>
> It's really not a problem of windows mindset, rather is what is the bring up platform when silicon is in the lab with HW designer support.  Today no surprise we do that almost exclusively on windows.  Display team is working hard to change that to have linux in the mix while we have the attention from HW designers.  We have a recent effort to try to enable all power features on Stoney (current gen low power APU) to match idle power on windows after Stoney shipped.  Linux driver guys working hard on it for 4+ month and still having hard time getting over the hurdle without support from HW designers because designers are tied up with the next generation silicon currently in the lab and the rest of them already moved onto next next generation.  To me I would rather have everything built on top of DC, including HW diagnostic test suites.  Even if I have to build DC on top of DRM mode setting I would prefer that over trying to do another bring up without HW support.  After all as driver developer refactoring and changing code is more fun than digging through documents/email and experimenting with different combination of settings in register and countless of reboots to try get pass some random hang.
>
> FYI, just dce_mem_input.c programs over 50 distinct register fields, and DC for current generation ASIC doesn't yet support all features and power optimizations.  This doesn't even include more complex programming model in future generation with HW IP getting more modular.  We are already making progress with bring up with shared DC code for next gen ASIC in the lab. DC HW programming / resource management / power optimization will be fully validated on all platforms including Linux and that will benefit the Linux driver running on AMD HW, especially in battery life.
>
> Just in case you are wondering Polaris windows driver isn't using DC and was on a "windows architecture" code base.  We understand that from community point of view you are not getting much feature / power benefit yet because CI/VI/CZ/Polaris Linux driver with DC is only used in Linux and we don’t have the man power to make it fully optimized yet.  Next gen will be performance and power optimized at launch.  I acknowledge that we don't have full feature on Linux yet and we still need to work with community to amend DRM to enable FreeSync, HDR, next gen resolution and other display feature just made available in Crimson ReLive.  However it's not realistic to engage with community early on in these efforts, as up to 1 month prior to release we were still experimenting with different solutions to make the feature better and we wouldn't have known what we end up building half year ago.  And of course marketing wouldn't let us leak these features before Crimson launch.

This is something you need to fix, or it'll stay completely painful
forever. It's hard work and takes years, but here at Intel we pulled
it off. We can upstream everything from a _very_ early stage (can't
tell you how early). And we have full marketing approval for that. If
you watch the i915 commit stream you can see how our code is chasing
updates from the hw engineers debugging things.

> I would like to work with the community and I think we have shown that we welcome, appreciate and take feedback seriously.  There is plenty of work done in DC addressing some of the easier to fix problems while we have next gen ASIC in the lab as top priority.  We are already down to 66k lines of code from 93k through refactoring and remove numerous abstractions.  We can't just tear apart the "mid layer" or "HAL" over night.  Plenty of work need to be done to understand if/how we can fit resource optimization complexity into existing DRM framework.  If you look at DC structure closely, we created them to plug into DRM structures (ie.  dc_surface == FB/plane, dc_stream ~= CRTC, dc_link+dc_sink = encoder + connector), but we need a resource layer to decide how to realize the given "state" with our HW.  The problem is not getting simpler as on top of multi-plane combine, shared encoders and clock resources,  compression is starting to get into display domain.  By the way, existing DRM structure do fit nicely for HW of 4 generations ago, and with current windows driver we do have concept of crtc, encoders, connector. However over the years complexity has grown and resource management is becoming a problem, which led us to design of putting in a resource management layer.  We might not be supporting full range of what atomic can do and our semantics may be different at this stage of development, but saying dc_validate breaks atomic only tells me you haven't take a close look at our DC code.  For us all validation runs same topology/resource algorithm in check and commit.  It's not optimal yet as we will end up doing this algorithm twice today on a commit but we do intend to fix it over time.  I welcome any concrete suggestions on using existing framework to solve the resource/topology management issue.  It's not too late to change DC now but after couple year after more OS and ASICs are built on top of DC it will be very difficult to change.

I guess I assumed too much that midlayer is a known thing. No one's
asking AMD to throw all the platform DC code away. No ones asking you
to rewrite the bandwidth calculations, clock tuning and all the code
that requires tons of work at power on to get right. Asking for that
would be beyond silly.

The disagreement is purely about how all that code interfaces with the
DRM subsystem, and how exactly it implements the userspace ABI. As
you've noticed the DRM objects don't really fit well for todays
hardware any more, but because it's Linux userspace ABI we can't ever
change those (well at least not easily) and will be stuck for another
few years or maybe even decades with them. Which means _every_ driver
has to deal with an impendence mismatch. The question now is how you
deal with that impendence mismatch.

The industry practice has been to insert an abstraction layer to
isolate your own code as much as possible. The linux best practice is
essentially an inversion of control where you write a bit of
linux-specific glue which drives the bits and pieces that are part of
your backend much more directly. And the reason why the abstraction
layer isn't popular in linux is that it makes cross-vendor
collaboration much more painful, and unecessarily so. Me misreading
your atomic code is a pretty good example - of course you understand
it and can see that I missed things, it's your codebase. But as
someone who reads drm drivers all day long stumbling over a driver
where things work completely differently means I'm just lost, and it's
much harder to understand things. And upstream does optimize for
cross-vendor collaboration.

But none of that means you need to throw away your entire backend. It
only means that the interface should try to be understandle to people
who don't look at dal/dc all day long. So if you say above that e.g.
dc_surface ~ drm_plane, then the expectation is that dc_surface embeds
(subclassing in OO speak) drm_plane. Yes there will be some mismatches
and there's code patterns and support in atomic to handle them, but it
makes it much easier to understand vendor code for outsides.

And this also doesn't mean that your backend code needs to deal with
drm_planes all the time, it will still deal with dc_surface. The only
thing that kinda changes is that if you want to keep cross-vendor
support you might need some abstraction so that your shared code
doesn't heavily depend upon the drm_plane layout.

Similar for resource optimization and state handling in atomic:
There's a very clear pattern that all DRM drivers follow, which is
massively extensible (you're not the only ones support shared
resources, and not the only vendor where the simple
plane->crtc->encoder pipeline has about 10 different components and IP
blocks in reality). And by following that pattern (and again you can
store whatever you want in your own private dc_surface_state) it makes
it really easy for others to quickly check a few things in your
driver, and I wouldn't have made the mistake of not realizing that you
do validate the state in atomic_check.

And personally I don't believe that designing things this way round
will result in more unshared code between different platforms. The
intel i915 is being reused (not on windows, but on a bunch of other
more fringe OS), and the people doing that don't seem to terribly
struggle with it.

>> Now the reason I bring this up (and we've discussed it at length in
>> private) is that DC still suffers from a massive abstraction midlayer.
>> A lot of the back-end stuff (dp aux, i2c, abstractions for allocation,
>> timers, irq, ...) have been cleaned up, but the midlayer is still there.
>> And I understand why you have it, and why it's there - without some OS
>> abstraction your grand plan of a unified driver across everything
>> doesn't work out so well.
>>
>> But in a way the backend stuff isn't such a big deal. It's annoying
>> since lots of code, and bugfixes have to be duplicated and all that,
>> but it's fairly easy to fix case-by-case, and as long as AMD folks
>> stick around (which I fully expect) not a maintainance issue. It makes
>> it harder for others to contribute, but then since it's mostly the
>> leaf it's generally easy to just improve the part you want to change
>> (as an outsider). And if you want to improve shared code the only
>> downside is that you can't also improve amd, but that's not so much a
>> problem for non-amd folks ;-)
>
> Unfortunately duplicating bug fixes is not trivial and if code base diverge some of the fixes will be different.  Surprisingly if you track where we spend our time, < 20% is writing code.  Probably 50% is trying to figure out which register need a different value programmed in those situations. The other 30% is trying to make sure the change doesn’t break other stuff in different scenarios.  If power and performance optimizations remains off in Linux then I would agree with your assessment.
>
>> I've only got one true power as a maintainer, and that is to say No.
>
> We AMD driver developer only got 2 true power over community, and that is having access to internal documentation and HW designers.  Not pulling Linux into the mix while silicon is still in the lab means we lose half of our power (HW designer support).
>
>> I've also wondered if the DC code is ready for being part of the kernel
>> anyways, what happens if I merge this, and some external
>> contributor rewrites 50% of it and removes a bunch of stuff that the
>> kernel doesn't need. By any kernel standards I'll merge that sort of
>> change over your heads if Alex doesn't, it might mean you have to
>> rewrite a chunk of your internal validation code, or some other
>> interactions, but those won't be reasons to block the changes from
>> my POV. I'd like some serious introspection on your team's part on
>> how you got into this situation and how even if I was feeling like
>> merging this (which I'm not) how you'd actually deal with being part
>> of the Linux kernel and not hiding in nicely framed orgchart silo
>> behind a HAL.
>
> We have come a long way compare to how we used to be windows centric, and I am sure there is plenty of work remaining for us to be ready to be part of the kernel.  If community has clever and clean solution that doesn’t break our ASICs we’ll take it internally with open arms.  We merged Dave and Jerome’s clean up on removing abstractions and we had lots of patches following Dave and Jerome’s lead in different area.
>
> Again this is not about orgchart.  It’s about what’s validated when samples are in the lab.
>
> God I miss the day when everything is plugged into the wall and dual link DVI was cutting edge.  At least most of our problem can be solved by diffing register dump between good and bad case.

Yeah, nothing different to what we suffer/experience here at Intel ;-)
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch