[Mesa-dev] [PATCH 2/2] gallivm: handle srgb-to-linear and linear-to-srgb conversions

Thu Jul 11 15:21:43 PDT 2013

Am 11.07.2013 19:41, schrieb Jose Fonseca:
>>> Please use lp_build_polynomial. It tries to avoid data dependency.
>>> Furthermore, if we start using FMA, then it's less one place to update.
>> Ok. Are you sure it's worth avoiding data dependency at the cost of extra
>> instructions (the way I built the polynomial, it's 6 instructions, and with
>> lp_build_polynomial it would be 7)? 
> 
> I'm not sure for this particular polynomial order (you could benchmark).  It did make a significant improvement for log2/exp2 's polynomials at the time James did this.
> 
> If it's not worth it, then lp_build_polynomial should do a straight polynomial for that order and lower.  But lp_build_polynomial should still be used no matter what.  The expectation being that lp_build_polynomial will emit the best code possible for any polynomial.
Yes I guess for low order polys it won't make much difference either
way. I couldn't measure any difference and if you just look at the poly
sequence it's easy to see why. The code I did initially had a dependency
chain of 3 muls, 3 adds (in clocks that would be 3*5 for the mul, 3*3
for add so 24 clocks on SNB). The polynomial build doesn't change the
picture much, the dependency chain now has 3 muls, 2 adds which is 21
clocks (while another mul+add sequence can be done in parallel).
If we'd use FMA though straightforward would definitely be preferred
since it would be 3 FMAs (all dependent) whereas with dependency
avoiding it would be 1 MUL + 3 FMAs, with a dependency chain of 1 MUL +
2 FMAs, and since MULs and FMAs have same latency it's essentially an
extra mul for nothing. Still that's a tiny fish to fry...
FWIW for 2nd degree polynomials the data-depency avoding sequence is
always worse as it's going to be mul/mul/add/mul/add, all dependent
anyway, whereas straightforward sequence would just be mul/add/mul/add.
No such callers though.


>> I thought because r/g/b will be done in
>> parallel anyway it wouldn't be much of an issue. Didn't measure it, though.
>> I am actually not really sure if fma isn't already used, while this is
>> an non-conformant optimization to optimize mul+add into fma some
>> compilers do it by default anyway IIRC.
> 
> If you want to add a new flag lp_build_polynomial to force a straightforward polynomial expansion that's fine too
No as I can't tell the difference I'll skip that :-).
I think most of the time we really have no good idea if llvm (or the cpu
itself) has any chance of scheduling around dependencies. Only for
srgb->linear there's some rough idea it should probably be possible
because of the 3 channels we're doing in parallel.


Roland