[Openicc] Xorg low level buffers - colour conversion

Sat Mar 8 01:38:14 PST 2008

Gerhard Fuernkranz wrote:
> Tomas Carnecky wrote:
>> If the color conversion is just a matrix transformation, then that can be very easily done in a few lines of a shader program. However if it involves lookup tables and such additional data then the shader becomes a bit more complicated.
> 
> IMO the general case (apply a device link) typically involves
> (tetrahedral) interpolation of multi-dimensional lookup tables (rather
> big ones, with a magnitude of say 10000..100000 table entries), and only
> special cases can be handled in a simpler way (e.g. TRC -> matrix ->
> TRC), though I'm not sure whether this simpler computation will be
> really much faster eventually.

I don't think that matters for the GPU performance either. The thing is 
that the lookup tables will be a bit more complicated to set up then 
just a pure shader solution. (basically you first have to upload the 
LUTs into the video card, and they can't be shared between applications).

> Btw, can GPUs only do massive parallel floating point operations, or
> also massive parallel integer operations (which is of course only of
> interest if the latter are even faster than the FP operations then)?

No integer operations (yet), I think the next generation of GPUs will 
have that. They usually only do 16/32 bit floating points, no double 
precision either.

> Sorry for my ignorance, I'm also wondering, does one just need to write
> the shader program for the color transformation of a single pixel, and
> this program gets then vectorized and applied to each pixel
> automatically (and in parallel) by OpenGL and the GPU?

The fragment shader (which is what is of interest here) is executed for 
every fragment (~pixel) separately. A simple (no-op) fragment shader 
looks like this:

void main()
{
	gl_FragColor = gl_Color;
}

>> If you want to have the shader running on the GPU, you first have to 
>> upload the data to the graphics card memory, then run the shader, and 
>> then copy the result back to RAM. That adds some delay, especially the 
>> reading back to system RAM, which is slow on the AGP bus (much faster
>> on PCI-Express).
> 
> Even if reading back is not so fast, I'm wondering, whether processing a complete image by the GPU may be possibly still faster than doing the multi-dimensional interpolation for each pixel with the CPU? (For comparison, for 8-bit 3D color transformations I get about 14 Mpixel/s with Argyll's IMDI routines on my Mobile AMD Athlon(tm) 64 4000+, and for 16-bit 3D transformations it's about 3 Mpixel/s -- and the IMDI routines are certainly pretty fast integer interpolation routines (they don't use SIMD instructions, though)). In order to beat the 14 Mpixel/s it would be necesary to copy 3*14=42 Mbyte to the graphics card and to copy the same amount of data back in less than one second (since the interpolation takes some GPU time too). Is this reasonable? (And for the 16-bit transform we'd only need to beat 3 Mpixel/s, which implies copying 6*3=18 Mbyte/s forth and back, + interpolation on the GPU)
> 

AGP is not full-duplex and has a bandwidth of ~2GB/s (AGP 8x), 
PCI-Express is full-duplex and has a bandwidth of 4GB/s (each 
direction). My card which is on a PCI-Express x8 bus can transfer ~1GB/s 
of raw data. If you use some of the OpenGL extensions to do that 
asynchronously you can gain a bit speed if you're processing lots of 
different images sequentially (say, video).
Processing 42MB of data within less then one second is very much possible.

Btw, these transformation engines in the CM systems, do they use mmx/sse 
or are the routines otherwise optimized in assembler? Or is it all 
written in C?

tom