[Mesa-dev] Path to optimize (moving from create/bind/delete paradgim to set only ?)

Tue Nov 16 23:43:30 PST 2010

Hi,

On Tuesday, November 16, 2010 20:21:26 Jerome Glisse wrote:
> So i looked a bit more at what path we should try to optimize in the
> mesa/gallium/pipe infrastructure. Here are some number gathers from
> games :
> drawcall /     ps constant   vs constant     ps sampler    vs sampler
> doom3            1.45             1.39               9.24              9.86
> nexuiz             6.27             5.98               6.84             
> 7.30 openarena  2805.64             1.38               1.51             
> 1.54
[...]

Just an other observation:
I was doing some profiling on OpenSceneGraph based applications. One of which 
the plain osgviewer with numerous models and one of which flightgear.
Drivers and hardware I have is a FireGL 73???,R520,r300g and a 
HD4890,RV770,r600g. Testing is done in the same cpu board.

One comparison is the draw time in osgviewer. The ones that know this 
application might remember the profiling graph, where you can see how long and 
when cull, draw and, if available gpu rendering happens.
I was in this case looking at the draw times which is just the time starting 
from the first state change in a frame to the last draw in a frame *excluding* 
the buffer swap/sync/flush and whatever serializes program execution tith the 
gpu.

Comparing these osgviewer draw times with fglrx with my favourite test model 
(fixed function) that is kind of representative for usage in flightgear.

R520
fglrx           ~0.7ms
r300g, git  ~1.6ms

The profiling picture of my head is that r300g still spends significant amount 
cpu time in current state attribute handling which is too often looping over 
all possible state attributes. BTW: that was much worse before Fancescos last 
copy to current patches. r300g also spends much time in the draw path in mesa, 
where every draw is looping over all 32 state attributes.
Doing some proof of concept work on these code paths improoved the draw times 
to 1.2ms on r300g.
The next cpu hog for r300g is the kernel side of the command stream parser. I 
would expect that something that makes use of preevaluated and validated 
command stream snippets in the kernel that are held for each of the drivers 
state objects and are just used in the executed command stream would help much 
here. Something along the lines of recording command stream macros/substreams 
that are just jumped into when executing the user level command stream. I 
believe that Jerome held some talk about something very similar at this years 
fossdem.

Translating that performance numbers from an example application to a more 
real world one like flightgear brings a framerate of ~85 frames for fglrx and 
~60 with current mesa. With the proof of concept stuff I already saw 65-70 on 
r300g.

Now the picture for r600g:

RV770
fglrx           ~0.8ms
r600g, git   5-7ms

As you can see fglrx is still about the same. But r600g is far off.
Also with r600g I can see the driver spending about as much time in parsing 
and validating in kernel as I can see it spending in the r600g backend code.

I do not remember the flightgear framerates for RV770,fglrx, but I believe they 
were comparable to the R520 ones, but with r600g I still see just about 20-30 
frames. Fiddling with these proof of concept stuff does not show up in r600g in 
a noticable way since this one is just dominated by its own backend cpu 
cycles.

So, I cannot contribute to this discussion which ones of the state objects are 
more heavily used, but looking at the above I see that r300g is already at a 
stage where it makes highly sense to improove some hot paths in mesas top 
layer. The r300 userspace backend code is visible but not high in profiles.
But r600g, using the same mesa/gallium infrastructure above spends much cpu 
cycles in its userspace as well as in the parser/validator code.
Which makes me wonder what is the fundamental difference of these two backends 
that accounts for this difference?

Just my 2cent

Mathias