[Mesa-dev] [PATCH 11/29] mesa: Use bitmask/ffs to iterate enabled lights for building ff shader keys.

Fri May 27 10:38:39 UTC 2016

On Fri, May 27, 2016 at 11:59 AM, Mathias Fröhlich
<Mathias.Froehlich at gmx.net> wrote:
>
> Hi,
>
> On Wednesday, May 25, 2016 12:06:02 you wrote:
>> On Tue, May 24, 2016 at 8:42 AM, <Mathias.Froehlich at gmx.net> wrote:
>> > From: Mathias Fröhlich <mathias.froehlich at web.de>
>> >
>> > Replaces a loop that iterates all lights and test
>> > which of them is enabled by a loop only iterating over
>> > the bits set in the enabled bitmask.
>
>> This takes the code from something very obvious and easy to follow to
>> something you'll have to think twice about the correctness of...
>
> IMO that is a matter of if you are used to that or not.
> When I saw this first now some years ago I had to look twice also,
> but I tend to take something like that as an opportunity to learn a
> new pattern that I can apply where appropriate.

I tend to disagree. There's more steps involved, and more things that
could be wrong, no matter how experienced you are. We're all human,
and mistakes do happen.

Besides, It's not like I haven't seen the pattern before; I've been
doing high-performance programming for the better part of 20 years.
But in that time, I've also learned that clever tricks come at a cost,
and you need to be careful in evaluating when that cost is worthwhile.

I'm not saying your changes aren't worthwhile, but without having some
(reproducable) numbers to back the claims up, it *might* be safer to
be conservative and not apply the patches.

>> Since MAX_LIGHTS is 8, this seems a bit like a premature optimization
>> to me. Does this patch yield any measurable improvement in speed in
>> any real-world applications?
>
> Depends on what you call real world. I did mostly look at osgviewer with
> some of the models I have here around. So, no nobody uses this model
> just purely with osgviewer in its daywork. But this is part of what you see
> in for example flightgear or closed applications for similar purposes I know
> of.
>
> Now, I get an improvement of about 750 to 800 frames with at least
> one model in osgviewer - well kind of representative favorite model I
> often use.
>
> Do you call this real world if I care about 750+50 fps when I cant see all
> the pictures being drawn because of the display frequency?
> I would say 'kind of'. Because this shows that for cpu bound models,
> we can save some time on the cpu.

If you're saying this patch alone takes a known work-load from ~750
fps to ~800 fps for viewing some model, I'd say that's worth it.
That's going from 1.33 ms to 1.25 ms per frame, a saving of about 6-7%
of CPU time. That's a big gain.

But if you're saying this whole patch-series gives that gain, then I'd
like to see performance break-down of the whole series; which of the
patches yield the biggest gain, and look into cherry-picking only the
ones with the most bang for the bucks.

> And the final real word application, where you typically display plenty
> of such models, may just be able to display more of these models in a
> single scene without taking the risk of frame drops at stable frame rate.
>
> Now you may argue that this is just a bad application or a bad model in a
> weakly optimizing application or both.

I'm not.

> Over the past few years, the intel driver backend have gained some
> speed not limited to but also in this regard. So that has really
> improved most in mesa among the machines I can look at every now and
> then (Great work guys Thanks!). If you now look at the profiles,
> you can see today a fair amount of mesa core functions beside
> the driver backend functions. Well I currently talk about zooming
> into the application of interest and then the i965_dri.so
> using perf. Having applied this series you see less
> of the mesa core functions that high in the profiles.
> Does that result in huge performance gains? No, not huge. For i965_dri
> I would claim that the ball is then back in the driver backends ballpark.
> But when the intel guys are playing that ball we get finally more
> improvements.
>
> Then there are other drivers too. Some of them will not see any measurable
> improvement because the backend still eats up most of the cpu time. I don't
> know which of them do what amount.
>
> I cannot place improvement tags on each of the individual changes.
> This is more like: If I do them all, I do observe improvements.

That sounds strange. Why can't you? Does really *every* one of the
patches need to be in place for there to be an improvement? This
sounds unlikely to me...

I can understand if you don't want to do 29 separate measurements, but
there must be some possible middle ground, no? Perhaps grouping them a
bit logically and try to measure each groups impact?

You say above that you have perf-measurements that shows numbers going
down, how about sharing those?

> So this pattern seems beneficial and I apply that to places where I can
> potentially see that they may help.
>
> The series as such is part of more fight against O(#max possible numbers)
> loops in the fast path of draws in core mesa. But what I have there still
> needs cleanup and a proper split into a series that can be reviewed. With
> that
> additional unpublished proof of concept hackery here I get an other frame
> rate
> improvement of 50-100fps at already 800fps then. Or if you want to
> put that in more positive words, than you get an other 10% faster cpu
> side draw times.
>
> So, why the patch series:
> I do today observe on my private notebook profile results that encourage
> looking again at core mesa functions. For a fair amount of the mesa
> functions being visible in these profiles, I have an idea how to push
> them down. This idea is already used a lot in gallium (I have learned
> that a mail ago) and somehow in core mesa, but it contains a pattern that
> is not that widely known but not too bad.
> That pattern enables loop complexities to go down
> from O(#max possible number) to O(#actual number) where typically
> '#actual number' is much smaller than '#max possible number'. And I can see
> the result in terms of improvement in the profiles. And even this appears
> visibly in an application.

Don't get me wrong, I do applaud your effort here. I'm just not
convinced by the patches alone. Some numbers would be nice.

And if this is indeed needed, perhaps something can be done to reduce
the risk of mistakes. Perhaps we could introduce some kind of
bitset_foreach() macro to reduce the number of places this convoluted
logic is needed?

For the record, the explanation above does tell me that there's more
substance to these patches than I initially thought.

> This sums up in: I have an opportunity for an improvement where
> I can just see that theory matches practical observations.
> For me this is a save bet.
>
> Thanks
>
> Mathias

Thanks for explaining!