probabilistic approach to tinderboxing

Bjoern Michaelsen bjoern.michaelsen at canonical.com
Tue Jun 12 07:45:06 PDT 2012


Hi all,

this is a proposal to get as much focused tinderboxing as possible with limited
resources and unreliable build clients. This also assumes patches under
consideration to be independant (that is: without conflicts) and not dependant
on a specific order. gerrit allow us to have patches independant of order as
they are not yet merged/picked to the branch. We can also easily exclude any
patch on gerrit that conflicts against the existing patches under
consideration: To do so we cherry-pick the new patch on the tip of the branch
and then cherry-pick all patches under consideration on top of that. If it
conflicts, we can report that back.(*)

== Testcase == 

Tinderboxes will be grouped in "testcases": All tinderboxes in belonging to one
"testcase" are assumed to provide the same results for the same input. The
simplest testcase is: "completing a build on platform foo" while more complex
testcases include "completed a build on platform foo and run checks".  A
testcase is assumed to have a rather stable long term empiric value that one
patch does not breaks it: p-innocent(testcase). Whenever a new patch is
commited for consideration we assume its inital "innocence" to be average:

 p-innocent(patch) = p-innocent(testcase)

== Giving a tinderbox a subset to test ==

There are two possibly useful results out of a tinderbox run:
- If the tinderbox run is successful, we can mark all patches as good and
  remove them from the patches under consideration
- If the tinderbox run is a failure and only contains one patch, we can mark
  that patch as bad and remove it from the patches under consideration
- All other cases have no directly deductable action

=== optimistic tinderbox run ===

A completed tinderbox run provides us with the most information, if we assume
it to fail half of the time (biggest insecurity). So, this is the first thing
to give a tinderbox to test:
 - Pick the most innocent looking patches (highest p-innocent(patch))
   - for patches with the same innocence prefer the oldest ones first
 - Add patches as long as the innocence of the patchset:
   p-innocent(patchset) = product(p-innocent(patchn) ...)
   is above 0.5
After sending off the tinderbox with this patchset, do:
   p-innocent(patch) *= p-innocent(testcase)
reducing the innocence of the patches without waiting for results. So we assume
the patches of the set to be partially guilty until proven innocent.

If the tinderbox succeeds, mark all patches as good and remove them from the
patches under consideration.

=== pessimistic tinderbox run ===

A failed tinderbox run provides us with most information, if it contains only
one patch. So if the innocence of one or more patches drops below 0.5:

 p-innocent(dirtypatch)<= 0.5

or if there is currently only one patch under consideration instead of doing
the optimistic approach above, do a pessimistic tinderbox run with just one
patch. Select the patch with the lowest p-innocent(patch) and, if there are
multiple patches with equal p-innocent(patch), select the oldest candidate.
After sending off the tinderbox with just this one patch, do:

 p-innocent(patch)/=p-innocent(testcase)

for all patches in each tinderbox run with this patch (once for each run)
without waiting for the tinderbox to return with a result. So we assume most of
those patches to be more innocent now (hoping the blame is on the dirtypatch).
The suspicous patch should now be back to the initial value:

 p-innocent(dirtypatch)=p-innocent(testcase)

The tinderbox will hopefully mark that patch as good or bad soon anyway and
remove it from the patches under consideration.

== advantages ==
This approach:
- Does not rely on the tinderboxes to ever come back with a result
  -- it should be robust against lost results.
- does not need to test each patch individually, if the average patch quality
  is not too bad
- requires only little bookkeeping on the side of the dispatcher
- should allow us to confidently mark all patches as 'doesnt break master for
  this testcase'

Opinions/Additions/Corrections? Objections? If not, we only need somebody to script
this ;)

Best,

Bjoern

P.S.: Dont show my careless use of 'propabilities' and such to the physics
department at the university where I got my degree. It will just result in
snorty laughs and cruel teasing.

(*) Conflicts might be rare though, so we might even get away with the
optimistic approach: just hoping that considered patches never conflict.Hi all,

this is a proposal to get as much focused tinderboxing as possible with limited
resources and unreliable build clients. This also assumes patches under
consideration to be independant (that is: without conflicts) and not dependant
on a specific order. gerrit allow us to have patches independant of order as
they are not yet merged/picked to the branch. We can also easily exclude any
patch on gerrit that conflicts against the existing patches under
consideration: To do so we cherry-pick the new patch on the tip of the branch
and then cherry-pick all patches under consideration on top of that. If it
conflicts, we can report that back.(*)

== Testcase == 

Tinderboxes will be grouped in "testcases": All tinderboxes in belonging to one
"testcase" are assumed to provide the same results for the same input. The
simplest testcase is: "completing a build on platform foo" while more complex
testcases include "completed a build on platform foo and run checks".  A
testcase is assumed to have a rather stable long term empiric value that one
patch does not breaks it: p-innocent(testcase). Whenever a new patch is
commited for consideration we assume its inital "innocence" to be average:

 p-innocent(patch) = p-innocent(testcase)

== Giving a tinderbox a subset to test ==

There are two possibly useful results out of a tinderbox run:
- If the tinderbox run is successful, we can mark all patches as good and
  remove them from the patches under consideration
- If the tinderbox run is a failure and only contains one patch, we can mark
  that patch as bad and remove it from the patches under consideration
- All other cases have no directly deductable action

=== optimistic tinderbox run ===

A completed tinderbox run provides us with the most information, if we assume
it to fail half of the time (biggest insecurity). So, this is the first thing
to give a tinderbox to test:
 - Pick the most innocent looking patches (highest p-innocent(patch))
   - for patches with the same innocence prefer the oldest ones first
 - Add patches as long as the innocence of the patchset:
   p-innocent(patchset) = product(p-innocent(patchn) ...)
   is above 0.5
After sending of the tinderbox with this patchset, do:
   p-innocent(patch) *= p-innocent(testcase)
reducing the innocence of the patches without waiting for results. So we assume
the patches of the set to be partially guilty until proven innocent.

If the tinderbox succeeds, mark all patches as good and remove them from the
patches under consideration.

=== pessimistic tinderbox run ===

A failed tinderbox run provides us with most information, if it contains only
one patch. So if the innocence of one or more patches drops below 0.5:

 p-innocent(dirtypatch)<= 0.5

instead of doing the optimistic approach above, do a pessimistic tinderbox run
with just one patch. Select the patch with the lowest p-innocent(patch) and, if
there are multiple patches with equal p-innocent(patch), select the oldest
candidate.
After sending of the tinderbox with just this one patch, do:

 p-innocent(patch)/=p-innocent(testcase)

for all patches in each tinderbox run with this patch (once for each run)
without waiting for the tinderbox to return with a result. So we assume most of
those patches to be more innocent now (hoping the blame is on the dirtypatch).
The suspicous patch should now be back to the initial value:

 p-innocent(dirtypatch)=p-innocent(testcase)

The tinderbox will hopefully mark that patch as good or bad soon anyway and
remove it from the patches under consideration.

== advantages ==
This approach:
- Does not rely on the tinderboxes to ever come back with a result
  -- it should be robust against lost results.
- does not need to test each patch individually, if the average patch quality
  is not too bad
- requires only little bookkeeping on the side of the dispatcher
- should allow us to confidently mark all patches as 'doesnt break master for
  this testcase'

Opinions/Additions/Corrections? Objections? If not, we only need somebody to script
this ;)

Best,

Bjoern

P.S.: Dont show my careless use of 'propabilities' and such to the physics
department at the university where I got my degree. It will just result in
snorty laughs and cruel teasing.

(*) Conflicts might be rare though, so we might even get away with the
optimistic approach: just hoping that considered patches never conflict.


More information about the LibreOffice mailing list