[Mesa-dev] Common GL usage pattern where mesa/gallium is underperforming

Fri Jan 14 08:52:56 PST 2011

On 01/13/2011 03:02 PM, Jerome Glisse wrote:
> Hi,
>
> This mail is mostly a remainder to myself and also a try to get
> someone to look into it before i myself got more time on this :)
>
> A common GLSL pattern is :
> glUseProgram(Program1);
> glActiveTexture(GL_TEXTURE0 + 0);
> glBindTexture(...);
> glActiveTexture(GL_TEXTURE0 + 1);
> glBindTexture(...);
> ...
> glUniform4fARB(...);
> ...
> glDraw()
>
> glUseProgram(Program2);
> glActiveTexture(GL_TEXTURE0 + 0);
> glBindTexture(...);
> glActiveTexture(GL_TEXTURE0 + 1);
> glBindTexture(...);
> ...
> glUniform4fARB(...);
> ...
> glDraw()
>
> Such usage pattern shouldn't trigger many computation inside mesa as
> we are not modifying texture object state or doing anythings fancy,
> it's just about switching btw GLSL program and textures. Which sounds
> like a very common pattern for any GL program that does somethings
> else than spinning gears :)
>
> I added glslstateschange to perf in mesa demos to test the
> performances in front of such usage pattern. Here is some results
> (core-i5 3Ghz, difference with closed driver are much worse with
> slower CPU) :
> noop            105.0 thousand change/sec
> nouveau         13.0 thousand change/sec
> fglrx-nosmp     57.5 thousand change/sec
> fglrx               73.4 thousand change/sec
> nvidia-nosmp 158.8 thousand change/sec
> nvidia            277.8 thousand change/sec
>
> All profiles/datas can be downloaded at
> http://people.freedesktop.org/~glisse/results/
>
> Obviously the noop driver shows that we are severly underperforming as
> we are slower than the closed nvidia driver while performing no real
> rendering, and we don't outperform fglrx by that much. Profiling of
> the noop driver and also of nexuiz (which has similar pattern) shows a
> couple of guilty points.
>
> Biggest offender is the recreation of sampler each time a texture is
> bind. The update_samplers (st_atom_sampler.c) memset their is likely
> useless, but the true optimization is to build sampler state along
> with the sampler_view_state as the only variable that doesn't came
> from the texture object is the lod bias value (unless i missed
> somethings). So idea is to build sampler along sampler_view when a
> texture object is finalized and to update this sampler state in
> update_sampler only if the lod bias value change, this should avoid a
> lot of cso creation overhead and speedup driver.
>
> I am not sure why update_textures shows so much in profile, my guess
> is that pipe_sampler_view_reference is burning cpu cycle as we likely
> have a lot of texture unit.
>
> Texture state is also a big offender, mesa revalidate texture state&
> texture coordinate generation each time a new shader program is bound.
> Plan here is to compute a mask for (see update_texture_state
> main/texstate.c) _EnabledUnits _GenFlags _TexGenEnabled for each
> program (compute this mask from program information as it's constant
> with the program) and only recompute state if we see states changes in
> the masked unit (ie unit that affect the bound program).
>
> I will get back to this optimization latter (in couple weeks
> hopefully), but if you have more idea or if someone remember of an
> easy improvement that can help for this situations that would be
> welcome :)

One of the root causes of this is the _NEW_TEXTURE flag is too broad. 
  We should probably have separate flags for new per-texture 
parameters, new texture bindings, new texture env, etc.  As it is now, 
if something as small as texture border color or lod bias is changed, 
we revalidate _all_ texture-related state.

The gallium sampler state and views should probably be stored in the 
texture objects and not per-context to reduce construction of those 
states.

It's great that you're looking at this.  I'll just warn you that this 
area (state validation) is an easy place to introduce regressions. 
Hopefully changes can be made in logical steps that are easy to 
verify.  Lots of testing should be done.

-Brian