[Mesa-dev] [RFC] Mesa 17.3.x release problems and process improvements

Wed Mar 14 05:06:36 UTC 2018

On 14/03/18 07:36, Mark Janes wrote:
> Daniel Vetter <daniel at ffwll.ch> writes:
> 
>> On Tue, Mar 13, 2018 at 4:46 PM, Mark Janes <mark.a.janes at intel.com> wrote:
>>> Daniel Vetter <daniel at ffwll.ch> writes:
>>>
>>>> On Mon, Mar 12, 2018 at 11:54:45PM -0700, Kenneth Graunke wrote:
>>>>> On Friday, March 9, 2018 12:12:28 PM PDT Mark Janes wrote:
>>>>> [snip]
>>>>>> I've been doing this for Intel.  Developers are on the hook to fix their
>>>>>> bugs, but you can't make them do it.  They have many pressures on them,
>>>>>> and a maintainer can't make the call as to whether a rendering bug is
>>>>>> more important than day-1 vulkan conformance, for example.
>>>>>>
>>>>>> We could heighten the transparency of what is blocking the build by
>>>>>> publicizing the authors of bisected blocking bugs to Phoronix, which
>>>>>> might get things moving.
>>>>>
>>>>> I hope you're being sarcastic here, or else I'm misunderstanding your
>>>>> proposal.  Public shaming of developers who create bugs has absolutely
>>>>> no place in the Mesa community, IMHO.  It would foster the kind of toxic
>>>>> community that none of us want to be a part of.
>>>>>
>>>>> Sometimes, people who create bugs are the very people that work the
>>>>> hardest, who the project may not even exist without.  Would you want
>>>>> to chew out someone for creating a bug in a Vulkan driver when...if it
>>>>> weren't for that person, you wouldn't have a Vulkan driver at all?  Or,
>>>>> maybe they caused a couple bad bugs...but also fixed hundreds of them.
>>>>>
>>>>> Other times, they're new contributors or volunteers who do this, not as
>>>>> their day job.  Frankly, those people are under no obligation to help us
>>>>> at all, so we need to thank them and appreciate the time and effort they
>>>>> spend - and give them a hand fixing things when they're too busy, or
>>>>> don't have the relevant hardware or skill to track down a regression.
>>>>>
>>>>> It's easy to be pissed off when there are bugs, and things seem to not
>>>>> be making progress, but let's try and keep things positive and work
>>>>> together to make Mesa the best we can.
>>>>
>>>> I'd like to second this with my experience from the kernel community. The
>>>> public shaming game for when you create a regression is very strong there,
>>>> lead by Linus Torvalds. In my experience this directly causes:
>>>>
>>>> - Maintainers to hide bug reports and regressions reports at all costs,
>>>>    because having Linus destroy you just aint never worth it. The meta game
>>>>    becomes "avoid getting railed" instead of "deliver quality code", and
>>>>    there's lots of ways to easily achieve the former that serious hurt the
>>>>    latter.
>>>>
>>>> - Best practice (in my experience) is to not mention the dreaded
>>>>    "REGRESSION" tag when you need another maintainer's help to fix a
>>>>    regression, because it's too likely they'll just panic. That means they
>>>>    start screaming at you to go away, or brain locks up and they can't
>>>>    effectively help you track down the bug (seen both cases).
>>>>
>>>> - Creates a culture where talking about process/tooling improvements to
>>>>    prevent regressions and/or handle them quicker becomes too dangerous,
>>>>    because it all turns into a personal shaming game of who maintains the
>>>>    worst subsystem.
>>>>
>>>> Long term you end up with a culture fucked up for good :-/
>>>>
>>>> Imo the only way to make this better is to try analyzing why a regressions
>>>> happened, and fix the tooling to prevent that in the future. Maybe better
>>>> test coverage (and long term efforts to fix known gaps), maybe better
>>>> presentation of automated checks (stuff like github pull requests that
>>>> automatically run CI and report full results, blocking the merge if
>>>> anything is amiss).
>>>
>>> You have to have a very strong CI to use it to block commits.  i965 Mesa
>>> has a big CI which identifies many regressions, but I wouldn't want to
>>> checkpoint commits in an automated way.  A large pool of obsolete
>>> CI hardware will have lower reliability than the mesa master branch --
>>> which generates noise for developers and impedes progress.
>>
>> This was all in general about blaming regressions on people, not
>> specifically for the stable-backporting-from-master issue here.
>>
>> And if parts of your CI can't autogate then you can make it more
>> informal - there's definitely stuff you want to autogate, like "does
>> it compile everywhere in all configs", and probably you don't want to
>> autogate on gen2 dying :-)
> 
> It's a bit different for us, because multiple companies and volunteers
> can push.  We have a buildtest which prevents intel engineers and any CI
> user from breaking radeon for example.  However, radeon still breaks
> when AMD devs push LLVM-version-dependent patches.  We can't stop that,
> and there are a set of similar situations where builds break.  Reverts
> and quick fixes are fine for this IMO.
> 
>> My point was if you don't want regressions, make it as easy as
>> possible for people to never push a regression (whether master or
>> stable trees) instead of a pillory or other blaming exercises. Litlle
>> things (like whether your CI results is in some mail somewhere, maybe
>> for an oudated version of your patches on a different baseline, or
>> right next to the "do you really want to merge" button) matters.
> 
> Agreed.  Anyone can painlessly test in our CI, and the majority of
> developers verifying patches in our CI are external.  We offer it to
> them after a regression is detected.  Usually, they make use of the CI,
> because they care about the product, and they want their patches to be
> great.
> 
> There have been a few situations where developers have skipped CI for
> what they thought was a trivial patch, and they caused regressions for
> everyone.  Lazy behavior can be quite disruptive, and can inflict cost
> on the community that you want to participate in.

I'd just like to point out that as an outside user of Intels CI I have 
missed regressions on a couple of occasions. However this was not due to 
"Lazy behaviour", having a CI system is fantastic and I'm very grateful 
to have access to it. However, it's not uncommon to run into issues and 
have no idea what is going on with the system.

Some examples are getting no emails back from the system after pushing, 
results that look like a successful run even though things have failed 
or just no results email at all, results with tonnes and tonnes of fails 
which are clearly unrelated to my latest branch, on occasion the system 
seems to have crashed? and been unresponsive for a whole weekend (which 
means down for me on a Monday in Australia).

If the change is for i965 I either wait or try bug someone at Intel (if 
anyone happens to be around) to find out what is going on, but for core 
mesa changes having run piglit on radeonsi locally I tend to push my 
changes. I get that regressions are frustrating at times but using the 
CI system as an outside user can be frustrating also when you have no 
idea whats going on in the black box after pushing a branch, especially 
when you need to wait an hour or so to try again in between runs.

I gave feedback at first on ways the system could be better, or errors I 
seemed to hit but was told there wouldn't be and improvements made for 
the foreseeable future so I stopped giving feedback and instead switched 
to relying on my own local testing when the CI system seemed to have 
lost its mind.

Anyway this is not meant to be a criticism, I just wanted to share my 
experience as an outside user.