[Mesa-dev] Very low framerate when recording desktop content in Weston using mesa git on Radeon 5770 (glReadPixels slow path)

Rune Kjær Svendsen runesvend at gmail.com
Tue Mar 26 10:26:38 PDT 2013


(I'm re-sending this message because the attachment was too large for
mesa-dev, and because I want to add wayland-devel CC. The valgrind output
can be found here: http://runeks.dk/files/callgrind.out.11362).

Seems like you are right Pekka.

I just ran weston through valgrind, and got some interesting results. I ran
it like so:

valgrind --tool=callgrind --dump-instr=yes --trace-jump=yes weston

Which allows me to get the time spent on a per-instruction level. Now, this
is running inside a virtual machine, so it won't be the same as running it
natively, but it agrees with the other benchmarks in the sense that it
suggests it is the simple calculations inside screenshooter.c that take up
most of the CPU time, not calls to outside functions (like it was before
with the slow glReadPixels() path).

The callgrind output can be found at the following URL:
http://runeks.dk/files/callgrind.out.11362
Open it with KCachegrind, select the function
weston_recorder_frame_notify(), and go to the "Machine Code" tab in the
lower right corner to see the interesting stuff.

According to callgrind, a total of 54.39% CPU time is used in the four
lines 251, 252, 253 and 255 in screenshooter.c. That's the function
component_delta():

dr = (next >> 16) - (prev >> 16);
 dg = (next >>  8) - (prev >>  8);
db = (next >>  0) - (prev >>  0);

 return (dr << 16) | (dg << 8) | (db << 0);

Additionally, the lines 358, 359, 361, 362, 363 take up 25.9% CPU time.
That is the innermost for-loop inside weston_recorder_frame_notify(). It's
the following lines with the call to component_delta() not included (line
360) since we've already included the CPU usage of that:

for (k = 0; k < width; k++) {
next = *s++;
delta = component_delta(next, *d);
 *d++ = next;
if (run == 0 || delta == prev) {
run++;

And then finally the call to output_run() on line 365 takes up 10.6% CPU
time:

p = output_run(p, prev, run);

Inside output_run(), the lines that take up the most CPU time are: 232,
233, 234, 235, 239, 240. They take up 9.94% CPU time. So it's basically the
whole while loop in output_run() except the line with the call to
__builtin_clz() (line 238):

while (run > 0) {
if (run <= 0xe0) {
*p++ = delta | ((run - 1) << 24);
 break;
}

i = 24 - __builtin_clz(run);
 *p++ = delta | ((i + 0xe0) << 24);
run -= 1 << (7 + i);
}

All of that adds up 90.89% CPU time spent inside screenshooter.c.

/Rune

On Tue, Mar 26, 2013 at 8:58 AM, Pekka Paalanen <ppaalanen at gmail.com> wrote:
On Tue, 26 Mar 2013 03:30:58 +0100
Rune Kjær Svendsen <runesvend at gmail.com> wrote:

> Marek, do you have an idea on where the currency bottleneck is?
>
> I just did a profiling with sysprof, zooming in on the desktop in Weston
> and moving the mouse wildly around, so that the buffer is completely
> changed for every frame. I got around 5 fps, which isn't *that* much, but
> still an order of magnitude better than without your patches.
>
> sysprof says there is 100% CPU usage, but unlike the previous 0.5-FPS
> recording, it's not in a single function, but spread out over several
> functions:
>
> 35% weston_recorder_frame_notify
> 11% __memcpy_ssse3
> 4.5% clear_page_c
> 4.3% output_run
>
> Although I'm not completely sure I'm reading the sysprof output right.
> weston_recorder_frame_notify, for example, has 35% CPU usage, but none of
> its child functions has any significant CPU usage. I presume the CPU usage
> in that function is from calling glReadPixels, although that's not
apparent
> from sysprof:
>
> weston_recorder_frame_notify                                     39.15%
>  39.15%
>   - - kernel - -                                                  0.00%
> 0.01%
>     ret_from_intr                                                 0.00%
> 0.01%
>       __irqentry_text_start                                       0.00%

Well, if you look at weston_recorder_frame_notify function, it has a
naive loop over each single pixel it processes. component_delta() may
get inlined, and output_run() you saw in the profile.

I think it's possible it actually is weston_recorder_frame_notify
eating the CPU. Can you get more precise profiling, like hits per source
line or instruction?


Thanks,
pq


Med venlig hilsen,

Rune Kjær Svendsen
Østerbrogade 111, 3. - 302
2100 København Ø
Tlf.: 2835 0726


On Tue, Mar 26, 2013 at 8:58 AM, Pekka Paalanen <ppaalanen at gmail.com> wrote:

> On Tue, 26 Mar 2013 03:30:58 +0100
> Rune Kjær Svendsen <runesvend at gmail.com> wrote:
>
> > Marek, do you have an idea on where the currency bottleneck is?
> >
> > I just did a profiling with sysprof, zooming in on the desktop in Weston
> > and moving the mouse wildly around, so that the buffer is completely
> > changed for every frame. I got around 5 fps, which isn't *that* much, but
> > still an order of magnitude better than without your patches.
> >
> > sysprof says there is 100% CPU usage, but unlike the previous 0.5-FPS
> > recording, it's not in a single function, but spread out over several
> > functions:
> >
> > 35% weston_recorder_frame_notify
> > 11% __memcpy_ssse3
> > 4.5% clear_page_c
> > 4.3% output_run
> >
> > Although I'm not completely sure I'm reading the sysprof output right.
> > weston_recorder_frame_notify, for example, has 35% CPU usage, but none of
> > its child functions has any significant CPU usage. I presume the CPU
> usage
> > in that function is from calling glReadPixels, although that's not
> apparent
> > from sysprof:
> >
> > weston_recorder_frame_notify                                     39.15%
> >  39.15%
> >   - - kernel - -                                                  0.00%
> > 0.01%
> >     ret_from_intr                                                 0.00%
> > 0.01%
> >       __irqentry_text_start                                       0.00%
>
> Well, if you look at weston_recorder_frame_notify function, it has a
> naive loop over each single pixel it processes. component_delta() may
> get inlined, and output_run() you saw in the profile.
>
> I think it's possible it actually is weston_recorder_frame_notify
> eating the CPU. Can you get more precise profiling, like hits per source
> line or instruction?
>
>
> Thanks,
> pq
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20130326/cdc67f58/attachment-0001.html>


More information about the mesa-dev mailing list