[Intel-gfx] Maintainer-review fluff (was: Re: [PATCH 01/12] drm/i915: plumb VM into object operations)

Tue Aug 6 00:19:33 CEST 2013

On Mon, Aug 5, 2013 at 11:33 PM, Jesse Barnes <jbarnes at virtuousgeek.org> wrote:
> On Sun, 4 Aug 2013 22:17:47 +0200
> Daniel Vetter <daniel at ffwll.ch> wrote:
>> Imo the "unpredictable upstream" vs. "high quality kernel support in
>> upstream" is a false dichotomy. Afaics the "unpredictability" is _because_
>> I am not willing to compromise on decent quality. I still claim that
>> upstreaming is a fairly predictable thing (whithin some bounds of how well
>> some tasks can be estimated up-front without doing some research or
>> prototyping), and the blocker here is our mediocre project tracking.
>
> Well, I definitely disagree here.  With our current (and recent past)
> processes, we've generally ended up with lots of hw support landing
> well after parts start shipping, and the quality hasn't been high (in
> terms of user reported bugs) despite all the delay.  So while our code
> might look pretty, the fact is that it's late, and has hard to debug
> low level bugs (RC6, semaphores, etc).
>
> <rant>
> It's fairly easy to add support for hardware well after it ships, and
> in a substandard way (e.g. hard power features disabled because we
> can't figure them out because the hw debug folks have moved on).  If we
> want to keep doing that, fine, but I'd really like us to do better and
> catch the hard bugs *before* hw ships, and make sure it's solid and
> complete *before* users get it.  But maybe that's just me.  Maybe
> treating our driver like any other RE or "best effort" Linux driver is
> the right way to go.  If so, fine, let's just not change anything.
> </rant>

The only thing I read here, both in the paragraph above and in the
rant is that we suck. I agree. My opinion is that this is because
we've started late, had too few resources and didn't seriously
estimate how much work is actually involved to enable something for
real.

The only reason I could distill from the above two paragraphs among
the ranting for way we are so much late is "So while our code might
look pretty, it's late and buggy". That's imo a farily shallow stab at
preceived bikesheds, but not a useful angle to improve process.

Now I agree that I uphold a fairly high quality standard for
upstreaming, but not an unreasonable one:
- drm/i915 transformed from the undisputed shittiest driver in the
kernel to one that mostly just works, while picking up development
pace. So I don't think I'm fully wrong on insisting on this level of
quality.
- we do ship the driver essentially continously, which means we can
implement features only by small refactoring steps. That clearly
involves more work than just stitching something together for a
product.

I welcome discussing whether I impose t0o high standards, but that
needs to be supplied with examples and solid reasons. "It's just too
hard" without more context isn't one, since yes, the work we pull off
here actually is hard.

Also note that Chris&me still bear the brute of fixing the random
fallout all over (it's getting better). So if any proposed changes
involves me blowing through even more time to track down issues I'm
strongly not in favour. Same holds for Chris often-heard comment that
a patch needs an improved commit message or a comment somewhere. Yes
it's annoying that you need to resend it (this often bugs me myself)
just to paint the bikeshed a bit different. But imo Chris is pretty
much throughout spot-on with his requests and a high-quality git
history has, in my experience at least, been extremely valueable to
track down the really ugly issues and legalese around all the
established precendence.

>> My approach here has been to be a royal jerk about test coverage for new
>> features and blocking stuff if a regression isn't tackled in time. People
>> scream all around, but it seems to work and we're imo getting to a "farly
>> decent regression handling" point. I also try to push for enabling
>> features across platforms (if the hw should work the same way) in the name
>> of increased test coverage. That one seems to be less effective (e.g. fbc
>> for hsw only ...).
>
> But code that isn't upstream *WON'T BE TESTED* reasonably.  So if
> you're waiting for all tests to be written before going upstream, all
> you're doing is delaying the bug reports that will inevitably come in,
> both from new test programs and from general usage. On top of that, if
> someone is trying to refactor at the same time, things just become a
> mess with all sorts of regressions introduced that weren't an issue
> with the original patchset...

QA on my trees and the igt testcoverage I demand for new features is
to catch regressions once something is merged. We've managed to break
code in less than a day since it's merged on multiple occasions, so
this is very real and just part of the quality standard I impose.

Furthermore I don't want that a new feature regresses overall
stability of our driver. And since that quality is increasing rather
decently I ask for more testcases to exercise cornercases to make sure
they're all covered. This is very much orthogonal to doing review and
just one more puzzle to ensure we don't go back to the neat old days
of shipping half-baked crap.

Note that nowadays QA is catching a lot of the regressions even before
the patches land in Dave's tree (sometimes there's the occasional
brown paper bag event though, but in each such case I analysis the
failure mode and work to prevent it in the future). And imo that's
squarely due to much improved test coverage and the rigid test
coverage requirements for new feautures I impose. And of course the
overall improve QA process flow with much quicker regression
turnaround times also greatly helps here.

Now I agree (and I think I've mentioned this a bunch of times in this
thread already) that this leads to a pain for developers. I see two
main issues, both are (slowly) improving:
- Testing patch series for regressions before merging. QA just set up
the developer patch test system, and despite that it's still rather
limited Ben seems to be fairly happy with where it's going. So I think
we're on track to improve this and avoid the need for developers to
have a private lab like Chris and I essentially have.
- Rebase hell due to ongoing other work. Thus far I've only tried to
help here by rechecking/delaying refactoring patches while big
features are pending. I think we need to try new approaches here and
imo better planing should help. E.g. the initial modeset refactor was
way too big and a monolithic junk that I've just wrestled in by
exorting r-b tags from you. In contrast the pipe config rework was
about equally big, but at any given time only about 30-50 patches
where outstanding (in extreme cases), and mutliple people contributed
different parts of the overall beast. Of course that means that
occasional, for really big stuff, we need to plan to write a first
proof of concept as a landmark where we need to go to, which pretty
much will be thrown away completely.

One meta-comment on top of the actual discussion: I really appreciate
critique and I've grown a good maintainer-skin to also deal with
really harsh critique. But I prefer less ranting and more concrete
examples where I've botched the job (there are plentiful to pick from
imo) and concrete suggestion for how to improve our overall process. I
think these process woes are painful for everyone and due to our fast
growth we're constantly pushing into new levels of ugly, but imo the
way to go forward is by small (sometimes positively tiny), but
continous adjustements and improvements.

I think we both agree where we'd like to be, but at least for me in
the day-to-day fight in the trenches the rosy picture 200 miles away
doesn't really help. Maybe I'm too delusional and sarcastic that way
;-)

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch