[Mesa-dev] Common GL usage pattern where mesa/gallium is underperforming

Thu Jan 13 14:02:24 PST 2011

Hi,

This mail is mostly a remainder to myself and also a try to get
someone to look into it before i myself got more time on this :)

A common GLSL pattern is :
glUseProgram(Program1);
glActiveTexture(GL_TEXTURE0 + 0);
glBindTexture(...);
glActiveTexture(GL_TEXTURE0 + 1);
glBindTexture(...);
...
glUniform4fARB(...);
...
glDraw()

glUseProgram(Program2);
glActiveTexture(GL_TEXTURE0 + 0);
glBindTexture(...);
glActiveTexture(GL_TEXTURE0 + 1);
glBindTexture(...);
...
glUniform4fARB(...);
...
glDraw()

Such usage pattern shouldn't trigger many computation inside mesa as
we are not modifying texture object state or doing anythings fancy,
it's just about switching btw GLSL program and textures. Which sounds
like a very common pattern for any GL program that does somethings
else than spinning gears :)

I added glslstateschange to perf in mesa demos to test the
performances in front of such usage pattern. Here is some results
(core-i5 3Ghz, difference with closed driver are much worse with
slower CPU) :
noop            105.0 thousand change/sec
nouveau         13.0 thousand change/sec
fglrx-nosmp     57.5 thousand change/sec
fglrx               73.4 thousand change/sec
nvidia-nosmp 158.8 thousand change/sec
nvidia            277.8 thousand change/sec

All profiles/datas can be downloaded at
http://people.freedesktop.org/~glisse/results/

Obviously the noop driver shows that we are severly underperforming as
we are slower than the closed nvidia driver while performing no real
rendering, and we don't outperform fglrx by that much. Profiling of
the noop driver and also of nexuiz (which has similar pattern) shows a
couple of guilty points.

Biggest offender is the recreation of sampler each time a texture is
bind. The update_samplers (st_atom_sampler.c) memset their is likely
useless, but the true optimization is to build sampler state along
with the sampler_view_state as the only variable that doesn't came
from the texture object is the lod bias value (unless i missed
somethings). So idea is to build sampler along sampler_view when a
texture object is finalized and to update this sampler state in
update_sampler only if the lod bias value change, this should avoid a
lot of cso creation overhead and speedup driver.

I am not sure why update_textures shows so much in profile, my guess
is that pipe_sampler_view_reference is burning cpu cycle as we likely
have a lot of texture unit.

Texture state is also a big offender, mesa revalidate texture state &
texture coordinate generation each time a new shader program is bound.
Plan here is to compute a mask for (see update_texture_state
main/texstate.c) _EnabledUnits _GenFlags _TexGenEnabled for each
program (compute this mask from program information as it's constant
with the program) and only recompute state if we see states changes in
the masked unit (ie unit that affect the bound program).

I will get back to this optimization latter (in couple weeks
hopefully), but if you have more idea or if someone remember of an
easy improvement that can help for this situations that would be
welcome :)

Cheers,
Jerome