[Mesa-dev] [PATCH 11/29] mesa: Use bitmask/ffs to iterate enabled lights for building ff shader keys.

Sat May 28 10:31:46 UTC 2016

Hi Erik,

On Friday, May 27, 2016 12:38:39 Erik Faye-Lund wrote:
> >> This takes the code from something very obvious and easy to follow to
> >> something you'll have to think twice about the correctness of...
> >
> > IMO that is a matter of if you are used to that or not.
> > When I saw this first now some years ago I had to look twice also,
> > but I tend to take something like that as an opportunity to learn a
> > new pattern that I can apply where appropriate.
> 
> I tend to disagree. There's more steps involved, and more things that
> could be wrong, no matter how experienced you are. We're all human,
> and mistakes do happen.
> 
> Besides, It's not like I haven't seen the pattern before; I've been
> doing high-performance programming for the better part of 20 years.
> But in that time, I've also learned that clever tricks come at a cost,
> and you need to be careful in evaluating when that cost is worthwhile.
> 
> I'm not saying your changes aren't worthwhile, but without having some
> (reproducable) numbers to back the claims up, it *might* be safer to
> be conservative and not apply the patches.
So, yes that basic 'keep it plain and simple' is something that
I support a lot.

HPC, nice, done so - be there, brought a hand build cluster to the
Top 500 somewhere by the beginning of this century. Starting with
some software abusing gcc as macro assembler and doing machine try
and error dgemm loops for building up optimal block sizes. Today
there is atlas doing this somehow but the one available back then
was called different. Then hand tuning the result of that on the
can and assembly level. Fun project back then. Today working with
computational fluid dynamics in the day job. Also with big pieces
of software. So I do clearly see the trade off you are talking about.

Anyhow, the general problem that is solved by this pattern is:
You have a non trivial amount of items of something. None to very few
of them are active/enabled in the common case. But to evaluate this items
you need to care for the general case.
Now what are the options:
1. look at all of them and check for enabled.
2. build up linked lists with enabled items.
3. Maintain some upper bound for the enabled items.
4. The bitmasks.

What else could we do for this case? Can you add more?

Option 1 is the simplest. I take this for everything non performance
critical and with a small fixed count. No comment required in the
code as just obvious. Nothing but the primary enabled value needs to
be maintained - so this is least error prone.

If the count is getting larger or the max is a user given value
then I most often go the linked list option #2. Pro is that you just exactly
walk the items of interest. Con is that you need to maintain the list. You
need two more pointers by item. Also you need a struct that you can put into
the linked list.

Then options number 3 is not really on my plate. Mesa does so with texture units
in some way, look at _MaxEnabledTexImageUnit. Pro is that it's somehow simple
as #1. Con is that you still need to maintain a value in addition to the
primary value. Biggest con is that for our setup nobody really enforces that an
application has to start with the GL_CLIP_PLANE0. So you may need to walk all
disabled items because only the last plane is enabled.

Then there is the bitmask. First point in the decision tree: ffs is really O(1) or
in contrast O(something(#bits)). The later tells me not to use this pattern as you
blow up O(n) option #1 loops into less obvious with at least O(something(#bits)).
Here 'something' is may be log if ffs is implemented sensibly or when done naively
even O(n). So that would result in no to little gain for something less obvious
(your argument from above).
>From now lets assume that we have a O(1) ffs. I could now think of asking
about: does the required number of bits fit into a cpu register? Roughly
are we in the GLbitmask{,64} case or in the src/util/bitset.h case in terms
of mesa. The second case may still be beneficial, but is already ranking
down for my taste as a ffs on the bitset is O(log(n)), you may now carry along
the previous iteration index to avoid this for the typical look case we mostly
need to handle, but that's even more complexity.
Ok, assume the bitmask fits into your cpu. Con is still that you need to maintain
the mask in addition to your primary value together with your argument about
less obvious loops. Pro is that you just exactly walk the items you are after.

And there is an other gain that can be exploited when using bitmasks.
There is only little use of that in the current series, probably more with
the series past that.
You can easily combine two such values into a single one:

for (int i; i < MAX; ++i) {
  if (!item[i].enabled)
    continue;
  if (!item[i].something_else)
    continue;
  ...
}

turns into:

unsigned mask = enabled_mask & something_else_mask;
while (mask) {
 ...
}

I have not taken the effort yet to look for such opportunities with the current
series, but there may be some. Sure you can go back to option #2 and introduce 
a separate linked list for this use case but thats additional maintenance of
the list which is more error prone.
The above is also really flexible in its use: Say you are interested in items
that are enabled but not something_else, that would just read:

unsigned mask = enabled_mask & ~something_else_mask;
while (mask) {
 ...
}

Ok, why that long story:
I want to motivate that I *do* share your concerns. The above is not
complete, but tell me if you have important point that I left out.

I always found it ugly in terms of maintainability that we have the general
pattern:

while (mask) {
  const int i = ffs(mask) - 1;
  struct item *s = &container->item[i];
  mask ^= (1u << i);
...
}

So the xor mask handing is non local to the place where we do the computation of i.
Brian pointed me to the u_bit_scan function heavily used in gallium. That definitely
helps for the basic problem I personally have with this type of loops.
Does that also somehow help for your concerns?

> >> Since MAX_LIGHTS is 8, this seems a bit like a premature optimization
> >> to me. Does this patch yield any measurable improvement in speed in
> >> any real-world applications?
> >
> > Depends on what you call real world. I did mostly look at osgviewer with
> > some of the models I have here around. So, no nobody uses this model
> > just purely with osgviewer in its daywork. But this is part of what you see
> > in for example flightgear or closed applications for similar purposes I know
> > of.
> >
> > Now, I get an improvement of about 750 to 800 frames with at least
> > one model in osgviewer - well kind of representative favorite model I
> > often use.
> >
> > Do you call this real world if I care about 750+50 fps when I cant see all
> > the pictures being drawn because of the display frequency?
> > I would say 'kind of'. Because this shows that for cpu bound models,
> > we can save some time on the cpu.
> 
> If you're saying this patch alone takes a known work-load from ~750
> fps to ~800 fps for viewing some model, I'd say that's worth it.
> That's going from 1.33 ms to 1.25 ms per frame, a saving of about 6-7%
> of CPU time. That's a big gain.

Well, I don't feel that this too big in the general case.
The reason why I am after that is that we do work in a library
context where we cannot know how we are getting used. For some use cases
that I have observed it helps. The use cases are legal and at least I
observe them with code that I have not written my self. So to me this is
an indication of 'happens in real world'.

> But if you're saying this whole patch-series gives that gain, then I'd
> like to see performance break-down of the whole series; which of the
> patches yield the biggest gain, and look into cherry-picking only the
> ones with the most bang for the bucks.

I cannot effort doing hard tests with all the variants. If you require this
its just not going to happen. I am really sorry.

So, why do I talk about hard tests:
I have been only looking at roughly 5 cases with osgviewer.
Looking at these numbers is easy if you interactively start up the app.
You already get some variations in the results. So to get hard numbers here
I would need to hack up osgviewer to grab the numbers and do statistics on
them to gain *hard* numbers.
I am mathematician, so if I call a number hard I need to have ruled out
everything else and I need to be able to put a square behind the sentence.

Cherry picking just the one you gain something will immediately rule out
all the changes working with driver backends I have no hardware for - obviously
I will measure no benefit there. But some of the changes in the driver backends
rely on the general availability of the bitmask. So what is the suggested
approach then? Maintain the linked list of lights as well as the bitmask because
I cannot motivate the change to r200?

Also, I am pretty sorry, but I have not measured the clip planes at all.
I am extrapolating that if it helps for lights, then the clip plane changes
are in the same ballpark. I am quite confident it is like that - for cases
that use clip planes.

I would even argue that the pattern as such is still somehow simple
and the more you cope with it the more you get trained in it.
I konw you can also put that around and say: The more you need to cope
with it, the more you introduce problems based on it.
... but think about the former at least.

> > Over the past few years, the intel driver backend have gained some
> > speed not limited to but also in this regard. So that has really
> > improved most in mesa among the machines I can look at every now and
> > then (Great work guys Thanks!). If you now look at the profiles,
> > you can see today a fair amount of mesa core functions beside
> > the driver backend functions. Well I currently talk about zooming
> > into the application of interest and then the i965_dri.so
> > using perf. Having applied this series you see less
> > of the mesa core functions that high in the profiles.
> > Does that result in huge performance gains? No, not huge. For i965_dri
> > I would claim that the ball is then back in the driver backends ballpark.
> > But when the intel guys are playing that ball we get finally more
> > improvements.
> >
> > Then there are other drivers too. Some of them will not see any measurable
> > improvement because the backend still eats up most of the cpu time. I don't
> > know which of them do what amount.
> >
> > I cannot place improvement tags on each of the individual changes.
> > This is more like: If I do them all, I do observe improvements.
> 
> That sounds strange. Why can't you? Does really *every* one of the
> patches need to be in place for there to be an improvement? This
> sounds unlikely to me...
> 
> I can understand if you don't want to do 29 separate measurements, but
> there must be some possible middle ground, no? Perhaps grouping them a
> bit logically and try to measure each groups impact?
Currently they are grouped logically to convert one item at a time. Some of the
changes depend on each other.

> You say above that you have perf-measurements that shows numbers going
> down, how about sharing those?

Yes. But keep in mind this is one snapshot that I let accumulate
until the numbers have settled.

Before:

   3.61%  [.] vbo_save_playback_vertex_list                                    ▒
   3.02%  [.] brw_upload_render_state                                          ▒
   2.45%  [.] brw_draw_prims                                                   ▒
   1.55%  [.] vbo_exec_copy_to_current                                         ▒
   1.31%  [.] calculate_attr_overrides                                         ▒
   1.25%  [.] brw_merge_inputs                                                 ▒
   1.08%  [.] intel_batchbuffer_require_space                                  ▒
   1.03%  [.] upload_sf_state                                                  ▒
   0.99%  [.] _mesa_load_state_parameters                                      ▒
   0.94%  [.] execute_list.part.16                                             ▒
   0.85%  [.] brw_prepare_vertices                                             ▒
   0.83%  [.] upload_clip_state                                                ▒
   0.82%  [.] vbo_exec_FlushVertices_internal                                  ▒
   0.70%  [.] brw_emit_vertices                                                ▒
   0.68%  [.] vbo_exec_wrap_upgrade_vertex                                     ▒
   0.62%  [.] hash_table_search                                                ▒
   0.57%  [.] _mesa_get_fixed_func_vertex_program                              ▒
   0.55%  [.] vbo_save_BeginCallList                                           ▒
   0.52%  [.] _mesa_get_fixed_func_fragment_program                            ▒
   0.48%  [.] brw_search_cache                                                 ▒
   0.45%  [.] _mesa_hash_data                                                  ▒
   0.43%  [.] _mesa_update_state_locked                                        ▒
   0.40%  [.] vbo_Materialfv                                                   ▒
   0.40%  [.] upload_sbe_state                                                 ▒
   0.35%  [.] set_add                                                          ▒

After:

   3.78%  [.] brw_upload_render_state                                          ▒
   3.34%  [.] vbo_save_playback_vertex_list                                    ▒
   2.61%  [.] brw_draw_prims                                                   ▒
   1.75%  [.] calculate_attr_overrides                                         ▒
   1.27%  [.] intel_batchbuffer_require_space                                  ▒
   1.16%  [.] upload_sf_state                                                  ▒
   1.11%  [.] _mesa_load_state_parameters                                      ▒
   1.07%  [.] brw_prepare_vertices                                             ▒
   0.99%  [.] vbo_exec_copy_to_current                                         ▒
   0.97%  [.] upload_clip_state                                                ▒
   0.94%  [.] brw_merge_inputs                                                 ▒
   0.86%  [.] vbo_exec_wrap_upgrade_vertex                                     ▒
   0.81%  [.] brw_emit_vertices                                                ▒
   0.69%  [.] execute_list.part.16                                             ▒
   0.66%  [.] hash_table_search                                                ▒
   0.60%  [.] brw_search_cache                                                 ▒
   0.55%  [.] _mesa_get_fixed_func_fragment_program                            ▒
   0.52%  [.] _mesa_update_state_locked                                        ▒
   0.49%  [.] _mesa_hash_data                                                  ▒
   0.48%  [.] upload_sbe_state                                                 ▒
   0.47%  [.] vbo_Materialfv                                                   ▒
   0.43%  [.] gen7_upload_constant_state                                       ▒
   0.42%  [.] _mesa_search_program_cache                                       ▒
   0.42%  [.] compute_light_positions.part.1                                   ▒
   0.40%  [.] set_add                                       

Note that the fps with this comparison was ~730 to ~800 fps for this run.
So we have a real improvement here. Means brw_upload_render_state is higher in
profile since it takes now more fraction of the smaller total and not
because something in the invalidate logic has pushed more state invalidations onto
the driver backend. Well, if this change would also breaks some of
the update logic, the potential improvement is even more.

I may need to note that I get no regressions with piglit quick test
on radeonsi, i965 and swrast.

Ok:
vbo_save_playback_vertex_list is more difficult to get far down. That would be
the most beneficial but has to wait for some preparations.
vbo_exec_copy_to_current went down from 1.55% to 0.99%.
vbo_exec_FlushVertices_internal vanished from the top 25 completely.
That is due to vertex attribute array tracking. For vbo_exec_FlushVertices_internal
it's probably reset_attrv that gets inlined there.
vbo_exec_wrap_upgrade_vertex is higher as it is probably not getting triggered in
its improved path and now the same absolute time is more relative time.

Then there is _mesa_get_fixed_func_vertex_program which also vanishes from
the top 25 completely. I believe it was somewhere at 0.35% now if you scroll.
The responsible is most probably not walking all the lights/textures. And the
texture thing is dependent on the point coord replace change as well as the
light bitmask change as this enables combining all the texture related masks
together with the light mask for iterating the 'unit' in the shader key.

Back to your concern: I also left out the shader creation loops
by your argument. The changes are only applied to the shader hash key creation
which happens basically always whereas the creation of the shaders is
one time per context - so I don't care.

Nothing for our current discussion, but may be interesting if
somebody else reads: A common long term observation is that
_mesa_load_state_parameters is traditionally high in profiles.

> And if this is indeed needed, perhaps something can be done to reduce
> the risk of mistakes. Perhaps we could introduce some kind of
> bitset_foreach() macro to reduce the number of places this convoluted
> logic is needed?
Look at the gallium function I mentioned above that I got told from Brian.
What do you think?

> For the record, the explanation above does tell me that there's more
> substance to these patches than I initially thought.
That was what I was willing to tell ...

> Thanks for explaining!
You are welcome.

Greetings

Mathias
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20160528/8961ce88/attachment-0001.html>