[Telepathy] Improving our testing by generating the scenarios

Sat Mar 7 07:57:08 PST 2009

Ar 02/03/2009 am 11:51, ysgrifennodd Sjoerd Simons:
> Hi,
> 
>   So for a while i've been pondering how we can improve the way we write tests.
>   Mostly for Wockey (the new of Gibber), because i want that to have very very
>   comprehensive tests. And a big part of making that happen is making it easy
>   to write tests :)

Rethinking our approach to tests is always good, as is making tests easier to
write!

>   To start of with a real story, this weekend i was hacking on Empathy's voip
>   support a bit more and specifically ensuring that the upcoming Gabble release
>   works well with it, when i noticed that Gabble gets the creator attribute
>   wrong when creating a new content _if_ it wasn't the creator of the session.
>   Now easy enough to add tests for that you'd think, just add a check for the
>   attribute when gabble sends a content-add. Unfortunately this means digging
>   through all the tests looking for cases where gabble does the content-add and
>   ensuring that it's checked at least once for the case where gabble initiates
>   the session and once where it doesn't... iotw, quite cumbersome.

Yes, this does sound like a tricky case. I don't know the details of this
specific case, but I would hope that the check could be added in a place where
it would automatically get made for each such stanza.

>   Now taking a step back and looking at how we write tests. Each test is
>   basically a ``small'' scenario. To test different aspects (e.g. FT is
>   openened and accepted, FT channel is opened and reject), different tests are
>   written. If things are properly done, most of this is abstracted and writing
>   a slighly different scenario is easy. In a lot of cases, it is done by
>   copying an existing test and tweaking it slightly.
> 
>   This works reasonably well if the number of possible scenarios is quite
>   small, but in practise that's not actually the case and we either write loads
>   and loads of boring code or only test a few common cases (guess what happens
>   in most cases)...
> 
> 
>   When looking at the various different related scenarios, you'll see that they
>   all basically exist out of mostly the same small steps, where some steps are
>   slightly different to test different things. So what we actually want to do
>   is to implement the various possible steps exactly once and then combine them
>   into different scenarios. But writing scenarious out of little steps is
>   still boring and there can be quite a few in for example jingle.
> 
>   Insted the scenarios should be generate the scenarios automatically.
>   Now to make this happen you need to know how all the steps fit together (e.g.
>   you can't accept a FT if there is no FT channel open). For that purpose each
>   step should declare pre- and post-conditions. To give some example steps:
> 
>   def start_incoming_ft (state):
>     PRE:
>     POST: incoming_ft
>     ....
> 
>   def accept_incoming_ft (state):
>     PRE: incoming_ft && !incoming_ft_open
>     POST: incoming_ft_open
>     ....
> 
>   def reject_incoming_ft (state):
>     PRE: incoming_ft && !incoming_ft_open
>     POST:
>     ....
> 
>   Possible scenarios that come out of this would be, getting an FT and
>   accepting it, getting an FT and then rejecting it. But also rejecting the
>   first two incoming FT's and accepting the third one. Obviously the number of
>   potential scenarios goes up quite fast when there are multiple potential next
>   steps, which makes it much more useful.

This seems more or less like declaring a state machine in the code, and then
have the test suite follow various paths through the suite.

>   Doing things this way means we can basically generate all possible scenarios
>   and ensure quite exhaustive testing. A first trivial implementation would be
>   to just generate all tests that do say less then 10 steps (to ensure the tests
>   stop in a reasonably timeframe).

Well, actual exhaustion of possibilities gets impractical for complicated
scenarios. But I definitely agree that we can do better than what we have.

If we are faced with not being able to test all possiblities, it can make
sense to randomly generate test cases. In the context of the framework you
suggest above, this would mean randomly generating a set of state
transmisisons. Test failures can be made repeatable by recording the choices
made, and allowing a way to manually control them in subsequent runs.

This is the approach taken by tools like QuickCheck and SmallCheck:

  http://www.cs.chalmers.se/~rjmh/QuickCheck/

SmallCheck differs from QuickCheck in that it generates cases inductively
rather than randomly. (Lazy SmallCheck is even more magic in that it detects
non-strictness in tested code to avoid redundant testing, but I doubt we can
apply that here.)

>   If we go completely crazy we could combine all steps from all tests, which
>   would then suddenly test if things are still ok if we're sending text
>   messages while receiving a file and making an outgoing voip call. But i don't
>   think we have precise enough control over gabble to actually make that
>   practical (or at least, it won't be easy).
> 
>   Downside is obviously that you actually need to think how to split tests up
>   into steps that are not dependant on gabbles internal state and allow for
>   a large number of possible tests. But that also makes it less boring :)

Yes, I can imagine that correctly matching the assertions made to
the state of the test might be quite tricky. QuickCheck solves this by having
assertions be invariants that must always hold; perhaps we can apply this by
having our tests be state-free, by which I mean that tests are simply
functions that take the current state and return a boolean indicating whether
it's valid or not. But I'm not sure that test of this form would be easier to
write.

>   Now hopefully there are already testing frameworks that work like this (or
>   that daf/wjt can implement it in 5 lines of haskell). So hopefully someone
>   can point me to one those (or volunteers to implement it for tp)

This reminds me of SQLite's description of how they test things:

  http://www.sqlite.org/testing.html

There are several ideas there that might be useful to us:

 - having both quick and short test suites
 - simulating IO errors
 - fuzzing
 - test coverage

Some we already do, like memory leak detection (by running the test suite
under valgrind).

I think the simplest step to improve our testing today would be to start
using test coverage data. That might not give us as much benefit as a new
testing method, but it will continue to be useful when we do have new testing
method and will also give us a way to compare the results that different
approaches give us.