[PATCH v1 weston 06/11] tests: Add screenshot recording to weston-test

Tue Nov 25 17:09:59 PST 2014

On Tue, Nov 25, 2014 at 11:46:05AM -0800, Bill Spitzak wrote:
> On 11/25/2014 07:11 AM, Derek Foreman wrote:
> 
> >What's fuzz exactly?  Each pixel can be within +-fuzz/2 on each color
> >component and still be a match?  a fuzz of 256 would match everything?
> 
> The cairo project has done some work on a fuzzy comparison for their
> tests, may be able to reuse that.

I've messed with this in Cairo (and in Inkscape previously), in writing
and debugging rendering errors.  There's several different ways things
tend to fail.

One is simply caused by rounding during antialiasing or whatever.  So,
instead of a pixel being RGB(100,100,100) you get RGB(99,99,99).  For
most purposes that's good enough.  So the fuzz is to make the comparison
routine allow colors to be +/- by N.  That should be a straightforward
mod of our existing code.

Second is caused by algorithmic differences.  We saw this with the
different pixman filtering algorithms used in downscaling images in
Cairo.  So, one algorithm might make two pixels 50% grey, another
algorithm might make one pixel 75% the other 25%.  Stuff like that.
Antialiasing can also produce variances like this.  In Cairo we work
around this by letting different backends have their own reference
images.  But I think a better approach is to be more precise in
selecting what to test; crop out a rectangular box where you expect it
must be blue and verify it is.

A third case I've seen is where alpha blending produces different
resultant colors for shapes.  Although none of the times this was seen
was it considered anything other than a bug...

A common fourth case is that a given backend simply hasn't implemented
XYZ (e.g. alpha transparency in Postscript).  XFail or custom reference
images are used here too, but I wonder if this could be handled better.

Anyway, Cairo's pdiff stuff doesn't really differentiate between these
different types of rendering failures - it just sums up how many pixels
differ and returns a percentage.  (Maybe it weights the tally based on
how different the color is, I forget).  As a result, Cairo's testsuite
emits false positives and false negatives a lot, and tests end up
requiring more maintenance than they should.  So I'm game to start from
what Cairo does but we shouldn't be shy about shooting for something a
bit more clever where feasible.

Bryce