[Mesa-dev] [PATCH 11/29] mesa: Use bitmask/ffs to iterate enabled lights for building ff shader keys.

Fri May 27 09:59:01 UTC 2016

Hi,

On Wednesday, May 25, 2016 12:06:02 you wrote:
> On Tue, May 24, 2016 at 8:42 AM,  <Mathias.Froehlich at gmx.net> wrote:
> > From: Mathias Fröhlich <mathias.froehlich at web.de>
> >
> > Replaces a loop that iterates all lights and test
> > which of them is enabled by a loop only iterating over
> > the bits set in the enabled bitmask.
> 
> This takes the code from something very obvious and easy to follow to
> something you'll have to think twice about the correctness of...

IMO that is a matter of if you are used to that or not.
When I saw this first now some years ago I had to look twice also,
but I tend to take something like that as an opportunity to learn a
new pattern that I can apply where appropriate.

> Since MAX_LIGHTS is 8, this seems a bit like a premature optimization
> to me. Does this patch yield any measurable improvement in speed in
> any real-world applications?

Depends on what you call real world. I did mostly look at osgviewer with
some of the models I have here around. So, no nobody uses this model
just purely with osgviewer in its daywork. But this is part of what you see
in for example flightgear or closed applications for similar purposes I know of.
Now, I get an improvement of about 750 to 800 frames with at least
one model in osgviewer - well kind of representative favorite model I
often use.
Do you call this real world if I care about 750+50 fps when I cant see all
the pictures being drawn because of the display frequency?
I would say 'kind of'. Because this shows that for cpu bound models, 
we can save some time on the cpu.
And the final real word application, where you typically display plenty
of such models, may just be able to display more of these models in a
single scene without taking the risk of frame drops at stable frame rate.

Now you may argue that this is just a bad application or a bad model in a
weakly optimizing application or both.
Yes, kind of, I agree, but you cant always change the applications.

If I look at the draw and gpu times in OpenSceneGraph based applications
over the years, I observe that the closed source nvidia driver needs
incredible few cpu time to schedule the draws for the gpu. The same model
on a comparable machine with a comparable amd/ati card is usually
2-3 times slower in terms of draw time on the cpu with the amd closed
source driver. And then there are mesa based drivers usually far
off of that picture.

Over the past few years, the intel driver backend have gained some
speed not limited to but also in this regard. So that has really
improved most in mesa among the machines I can look at every now and
then (Great work guys Thanks!). If you now look at the profiles,
you can see today a fair amount of mesa core functions beside
the driver backend functions. Well I currently talk about zooming
into the application of interest and then the i965_dri.so
using perf. Having applied this series you see less
of the mesa core functions that high in the profiles.
Does that result in huge performance gains? No, not huge. For i965_dri
I would claim that the ball is then back in the driver backends ballpark.
But when the intel guys are playing that ball we get finally more improvements.

Then there are other drivers too. Some of them will not see any measurable
improvement because the backend still eats up most of the cpu time. I don't
know which of them do what amount.

I cannot place improvement tags on each of the individual changes.
This is more like: If I do them all, I do observe improvements. So this pattern seems
beneficial and I apply that to places where I can potentially see
that they may help.

The series as such is part of more fight against O(#max possible numbers)
loops in the fast path of draws in core mesa. But what I have there still
needs cleanup and a proper split into a series that can be reviewed. With that
additional unpublished proof of concept hackery here I get an other frame rate
improvement of 50-100fps at already 800fps then. Or if you want to
put that in more positive words, than you get an other 10% faster cpu
side draw times.

So, why the patch series:
I do today observe on my private notebook profile results that encourage
looking again at core mesa functions. For a fair amount of the mesa
functions being visible in these profiles, I have an idea how to push
them down. This idea is already used a lot in gallium (I have learned
that a mail ago) and somehow in core mesa, but it contains a pattern that
is not that widely known but not too bad.
That pattern enables loop complexities to go down
from O(#max possible number) to O(#actual number) where typically
'#actual number' is much smaller than '#max possible number'. And I can see
the result in terms of improvement in the profiles. And even this appears
visibly in an application.

This sums up in: I have an opportunity for an improvement where
I can just see that theory matches practical observations.
For me this is a save bet.

Thanks

Mathias
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20160527/e2351f8b/attachment-0001.html>