More 16 vs 24 bpp profiling

Tue Sep 11 10:03:07 PDT 2007

(adding xorg-devel@ on Cc)

On 09/11/2007 11:29 AM, Jordan Crouse wrote:
> On 11/09/07 13:05 +0200, Stefano Fedrigo wrote:
>> I've done some more profiling on the 16 vs. 24 bpp issue.
>> This time I used this test:
>> https://dev.laptop.org/git?p=sugar;a=blob;f=tests/graphics/hipposcalability.py
>>
>> A simple speed test: I measured the time required to scroll down and up
>> one time all the generated list.  Not extremely accurate, but I repeated the
>> test a few times with consistent results (+- 0.5 secs).  Mean times:
>>
>> xserver 1.4
>> 16 bpp: 37.9
>> 24 bpp: 40.7
>>
>> xserver 1.3
>> 16: 46.4
>> 24: 50.1
>>
>> At 24 bpp we're a little slower.  1.3 is 20% slower than 1.4. The pixman
>> migration patch makes the difference: 1.3 spend most of that 20% in memcpy().
>>
>> The oprofile reports are from xserver 1.4.  I don't see much difference
>> between 16 and 24, except that at 24 bpp, less time is spent in pixman and more
>> in amd_drv.  At 16 bpp pixman_fill() takes twice the time.
>>
>> Unfortunately without a working callgraph it's not very clear to me what's
>> happening in amd_drv.  At 24bpp gp_wait_until_idle() takes twice the time...
> 
> What can we do to fix this?  I would really like to know who is calling
> gp_wait_until_idle().

I think the invocation in lx_get_source_color() can safely
go away, as exaGetPixmapFirstPixel() has always done
correct locking even in 1.3.

But because the1x1 source pixmap used as solid color is
still being uploaded to the framebuffer, I'd expect
exaGetPixmapFirstPixel() to indirectly call the driver
download hook and, thus, stall the GPU anyway.

If this tiny pixmap was at least reused, the second
time it would be already in system memory.  And it
seems that Cairo is trying to cache patterns in the CR.

Problem is, many GTK widgets like to create a new CR on every
repaint event, thus rendering the cache quite effective for a
typical workload of a window with several small widgets in it.
But I've stumbled in the caching code a few months ago while
debugging something else, so I may very well be mistaken.

On git's master, Michel Dänzer has recently been pushing
a long run of EXA performance patches.  I've had only a
quick glance, but it seems they may cure 

$ git-log 8cfcf9..e8093e | git-shortlog
Michel Dänzer (14):
      EXA: Track valid bits in Sys and FB separately.
      Add DamagePendingRegion.
      EXA: Support partial migration of pixmap contents between Sys and FB.
      EXA: Hide pixmap pointer outside of exaPrepare/FinishAccess whenever possible.
      EXA: Improvements for trapezoids and triangles.
      EXA: exaImageGlyphBlt improvements.
      EXA: Improvements for 1x1 pixmaps.
      EXA: RENDER improvements.
      EXA: Remove superfluous manual damage tracking.
      EXA: exaGetImage improvements.
      EXA: exa(Shm)PutImage improvements.
      EXA: Use exaShmPutImage for pushing glyphs to scratch pixmap in exaGlyphs.
      EXA: exaFillRegion{Solid,Tiled} improvements.
      EXA: Exclude bits that will be overwritten from migration in exaCopyNtoN.

Aleph, I guess it may be useful to re-run tests after applying
these patches.  In case merging them on the 1.4 branch happens
to be difficult, using the code from master should be ok.
They don't seem to have diverged too much, yet.

> Also, I think we're spending way too much time in
> gp_color_bitmap_to_screen_blt() - is there any way we
> can get more indepth profiling in that one function?

Good idea!

Meanwhile, I looked at gp_color_bitmap_to_screen_blt() and it
seems we're issuing a separate blit per horizontal line of the
source data.   That is correct for the general case, where the
destination width may not match the source pitch.

However, when we invoke gp_color_bitmap_to_screen_blt() for
uploads, I'd expect the destination buffer to match the source,
so a single blit would work.

If my guess is right, special casing "pitch == width*bpp"
would be a big win.  Anyone minds adding an ErrorF()?

NOTE ALEPH: I think we stopped development in the xf86-amd-devel
repo some time ago.  The correct driver nowadays would be the
fd.o one.  Jordan, do you confirm this?

-- 
   //  Bernardo Innocenti - http://www.codewiz.org/
 \X/ One Laptop Per Child - http://www.laptop.org/