Acceptable outcomes of SwarmSolverTest::testUnconstrained

Wed Feb 28 08:41:44 UTC 2018

On 28.02.2018 02:36, Tomaž Vajngerl wrote:
> On Tue, Feb 27, 2018 at 4:53 PM, Stephan Bergmann <sbergman at redhat.com> wrote:
>> SwarmSolverTest::testUnconstrained in sccomp/qa/unit/SwarmSolverTest.cxx has
>> already been weakened in the past,
>> <https://cgit.freedesktop.org/libreoffice/core/commit/?id=1fa761af825641da5c87f80c2a17135f92418960>
>> "Ridiculously large delta for SwarmSolverTest::testUnconstrained for now"
>> and
>> <https://cgit.freedesktop.org/libreoffice/core/commit/?id=0c3444c9bcee093ad5976af8948138e6f2a97706>
>> "Weaken SwarmSolverTest::testUnconstrained even further for now".  The first
>> one has the following in its commit message: "suggestion by Tomaž Vajngerl
>> was: 'Let's adapt the delta for now. Generally anything close to 3 should be
>> acceptable as the algorithm greatly depends on random values.'"
>>
>> Now <https://ci.libreoffice.org/job/lo_ubsan/833/console> failed with
>>
>>>
>>> /home/tdf/lode/jenkins/workspace/lo_ubsan/sccomp/qa/unit/SwarmSolverTest.cxx:106:(anonymous
>>> namespace)::SwarmSolverTest::testUnconstrained
>>> double equality assertion failed
>>> - Expected: 3
>>> - Actual  : 94.6605927051114
>>> - Delta   : 0.9
>>
>>
>> Is that also an acceptable outcome, or does it indicate a bug somewhere that
>> would need to be fixed?  What good is a test whose success criterion is the
>> result of ad-hoc guesswork, instead of being determined precisely up-front
>> when the test was written?
>> Can that test please be fixed properly, so that it would be actually useful?
> 
> Well, it is neither - that's just the nature of stochastic algorithms.
> It is not the fault of the test - how it was defined at the beginning
> was the exact outcome we would expect (just like a global maximum of
> an function is exactly one value). The problem is that the algorithm
> itself doesn't guarantee to find that solution or comes as near to the
> solution in its allotted time, allotted number of generations or just
> gets stuck in some local extreme value, however this should usually
> happen with a small statistical probability in a normal run of the
> algorithm that has a fast enough CPU.

Then those qualities of the algorithm need to be taken into account when 
writing the test, I think.  A small probability of failure is apparently 
still a problem.  We need tests to be reliable.

> Maybe I'm wrong but I don't see this failing in tinderboxes or
> jenkins, so I wonder what ubsan does to make it fail. The algorithm
> has a time limit, could it be that the execution is slowed down so
> much that the result didn't develop enough (I didn't expect this to be
> so). Could we skip it for ubsan only?

Those ASan+UBSan tinderbox builds execute rather slowly, yes. 
(<http://clang.llvm.org/docs/AddressSanitizer.html> claims "Typical 
slowdown introduced by AddressSanitizer is 2x.")

But also as reported by others today on #libreoffice-dev:

> Feb 28 09:17:32 <buovjaga>	sberg: I got a swamsolver failure yesterday. Then I pulled later and the next build went fine.
> Feb 28 09:19:03 <buovjaga>	After the failure, soffice refused to start. I don't have logs, unfortunately