[Mesa-dev] Very low framerate when recording desktop content in Weston using mesa git on Radeon 5770 (glReadPixels slow path)
Marek Olšák
maraeo at gmail.com
Tue Mar 26 17:29:21 PDT 2013
I recommend using OpenMP for this kind of pixel processing,
specifically the "parallel for". It's pretty easy to use and you could
do wonders with it, i.e. taking advantage of all CPU cores on *any*
system. You could also offload the whole thing to another thread and
continue there.
Marek
On Tue, Mar 26, 2013 at 6:26 PM, Rune Kjær Svendsen <runesvend at gmail.com> wrote:
> (I'm re-sending this message because the attachment was too large for
> mesa-dev, and because I want to add wayland-devel CC. The valgrind output
> can be found here: http://runeks.dk/files/callgrind.out.11362).
>
> Seems like you are right Pekka.
>
> I just ran weston through valgrind, and got some interesting results. I ran
> it like so:
>
> valgrind --tool=callgrind --dump-instr=yes --trace-jump=yes weston
>
> Which allows me to get the time spent on a per-instruction level. Now, this
> is running inside a virtual machine, so it won't be the same as running it
> natively, but it agrees with the other benchmarks in the sense that it
> suggests it is the simple calculations inside screenshooter.c that take up
> most of the CPU time, not calls to outside functions (like it was before
> with the slow glReadPixels() path).
>
> The callgrind output can be found at the following URL:
> http://runeks.dk/files/callgrind.out.11362
> Open it with KCachegrind, select the function
> weston_recorder_frame_notify(), and go to the "Machine Code" tab in the
> lower right corner to see the interesting stuff.
>
> According to callgrind, a total of 54.39% CPU time is used in the four lines
> 251, 252, 253 and 255 in screenshooter.c. That's the function
> component_delta():
>
> dr = (next >> 16) - (prev >> 16);
> dg = (next >> 8) - (prev >> 8);
> db = (next >> 0) - (prev >> 0);
>
> return (dr << 16) | (dg << 8) | (db << 0);
>
> Additionally, the lines 358, 359, 361, 362, 363 take up 25.9% CPU time. That
> is the innermost for-loop inside weston_recorder_frame_notify(). It's the
> following lines with the call to component_delta() not included (line 360)
> since we've already included the CPU usage of that:
>
> for (k = 0; k < width; k++) {
> next = *s++;
> delta = component_delta(next, *d);
> *d++ = next;
> if (run == 0 || delta == prev) {
> run++;
>
> And then finally the call to output_run() on line 365 takes up 10.6% CPU
> time:
>
> p = output_run(p, prev, run);
>
> Inside output_run(), the lines that take up the most CPU time are: 232, 233,
> 234, 235, 239, 240. They take up 9.94% CPU time. So it's basically the whole
> while loop in output_run() except the line with the call to __builtin_clz()
> (line 238):
>
> while (run > 0) {
> if (run <= 0xe0) {
> *p++ = delta | ((run - 1) << 24);
> break;
> }
>
> i = 24 - __builtin_clz(run);
> *p++ = delta | ((i + 0xe0) << 24);
> run -= 1 << (7 + i);
> }
>
> All of that adds up 90.89% CPU time spent inside screenshooter.c.
>
> /Rune
>
> On Tue, Mar 26, 2013 at 8:58 AM, Pekka Paalanen <ppaalanen at gmail.com> wrote:
> On Tue, 26 Mar 2013 03:30:58 +0100
> Rune Kjær Svendsen <runesvend at gmail.com> wrote:
>
>> Marek, do you have an idea on where the currency bottleneck is?
>>
>> I just did a profiling with sysprof, zooming in on the desktop in Weston
>> and moving the mouse wildly around, so that the buffer is completely
>> changed for every frame. I got around 5 fps, which isn't *that* much, but
>> still an order of magnitude better than without your patches.
>>
>> sysprof says there is 100% CPU usage, but unlike the previous 0.5-FPS
>> recording, it's not in a single function, but spread out over several
>> functions:
>>
>> 35% weston_recorder_frame_notify
>> 11% __memcpy_ssse3
>> 4.5% clear_page_c
>> 4.3% output_run
>>
>> Although I'm not completely sure I'm reading the sysprof output right.
>> weston_recorder_frame_notify, for example, has 35% CPU usage, but none of
>> its child functions has any significant CPU usage. I presume the CPU usage
>> in that function is from calling glReadPixels, although that's not
>> apparent
>> from sysprof:
>>
>> weston_recorder_frame_notify 39.15%
>> 39.15%
>> - - kernel - - 0.00%
>> 0.01%
>> ret_from_intr 0.00%
>> 0.01%
>> __irqentry_text_start 0.00%
>
> Well, if you look at weston_recorder_frame_notify function, it has a
> naive loop over each single pixel it processes. component_delta() may
> get inlined, and output_run() you saw in the profile.
>
> I think it's possible it actually is weston_recorder_frame_notify
> eating the CPU. Can you get more precise profiling, like hits per source
> line or instruction?
>
>
> Thanks,
> pq
>
>
> Med venlig hilsen,
>
> Rune Kjær Svendsen
> Østerbrogade 111, 3. - 302
> 2100 København Ø
> Tlf.: 2835 0726
>
>
> On Tue, Mar 26, 2013 at 8:58 AM, Pekka Paalanen <ppaalanen at gmail.com> wrote:
>>
>> On Tue, 26 Mar 2013 03:30:58 +0100
>> Rune Kjær Svendsen <runesvend at gmail.com> wrote:
>>
>> > Marek, do you have an idea on where the currency bottleneck is?
>> >
>> > I just did a profiling with sysprof, zooming in on the desktop in Weston
>> > and moving the mouse wildly around, so that the buffer is completely
>> > changed for every frame. I got around 5 fps, which isn't *that* much,
>> > but
>> > still an order of magnitude better than without your patches.
>> >
>> > sysprof says there is 100% CPU usage, but unlike the previous 0.5-FPS
>> > recording, it's not in a single function, but spread out over several
>> > functions:
>> >
>> > 35% weston_recorder_frame_notify
>> > 11% __memcpy_ssse3
>> > 4.5% clear_page_c
>> > 4.3% output_run
>> >
>> > Although I'm not completely sure I'm reading the sysprof output right.
>> > weston_recorder_frame_notify, for example, has 35% CPU usage, but none
>> > of
>> > its child functions has any significant CPU usage. I presume the CPU
>> > usage
>> > in that function is from calling glReadPixels, although that's not
>> > apparent
>> > from sysprof:
>> >
>> > weston_recorder_frame_notify 39.15%
>> > 39.15%
>> > - - kernel - - 0.00%
>> > 0.01%
>> > ret_from_intr 0.00%
>> > 0.01%
>> > __irqentry_text_start 0.00%
>>
>> Well, if you look at weston_recorder_frame_notify function, it has a
>> naive loop over each single pixel it processes. component_delta() may
>> get inlined, and output_run() you saw in the profile.
>>
>> I think it's possible it actually is weston_recorder_frame_notify
>> eating the CPU. Can you get more precise profiling, like hits per source
>> line or instruction?
>>
>>
>> Thanks,
>> pq
>
>
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
More information about the mesa-dev
mailing list