[Mesa-dev] st_TexSubImage: unaligned memcpy performance

Vasilis Liaskovitis vliaskov at gmail.com
Thu Apr 9 02:34:33 PDT 2015


Hi,

On Wed, Apr 8, 2015 at 1:24 PM, Daniel Stone <daniel at fooishbar.org> wrote:

> Hi,
>
> On 8 April 2015 at 10:57, Vasilis Liaskovitis <vliaskov at gmail.com> wrote:
> > I have an issue where st_TexSubImage causes very high CPU load in
> > __memcpy_sse2_unaligned (Mesa 10.1.3, Xorg 1.15.1, radeon driver, HD
> 7870).
> >
> > Any obvious causes / tips for this? e.g. align textures or use different
> > format/type? I 've tried using GL_BGRA/GL_UNSIGNED_BYTE and
> > GL_BGRA/GL_UNSIGNED_INT_8_8_8_8_REV
> >
> > __memcpy_sse2_unaligned () at
> > ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:85
> > 85    ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: No such file
> or
> > directory.
> > (gdb) bt
> > #0  __memcpy_sse2_unaligned () at
> > ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:85
> > #1  0x00007fffb572f154 in memcpy (__len=7680, __src=<optimized out>,
> > __dest=0x7fff5835f800) at /usr/include/x86_64-linux-gnu/bits/string3.h:51
> > #2  st_TexSubImage (ctx=0x1b91420, dims=<optimized out>,
> texImage=0x1f81710,
> > xoffset=0, yoffset=0, zoffset=0, width=1920, height=1080, depth=1,
> > format=32993, type=5121, pixels=0xdacf90, unpack=0x1bad590)
> >     at ../../../../src/mesa/state_tracker/st_cb_texture.c:752
>
> Your source (0xdacf90) is only aligned to a 16-byte boundary, not 32.
> This will cause issues particularly on ARM, where natural alignment is
> required (i.e. 32-byte load/stores must be on 32-byte boundaries). By
> contrast, the destination is already aligned to a 128-byte boundary.
> So fixing the caller, rather than Mesa, should take care of the
> problem.
>

thanks for the reply and the observation. I aligned source on 32-byte
boundary (or even 128-byte boundary) but there was no difference.
By the way, I am only using x86_64, not ARM. I believe intel sse2 only
requires 16-byte boundary alignment, but perhaps i am missing something.

Is this code path in st_TexSubImage using PBOs? I guess it depends on
driver (radeon in my case) implementation?

Related: pboUnpack http://www.songho.ca/opengl/files/pboUnpack.zip
gives: Transfer Rate: 236.5 MB/s. (59.1 FPS)
Does this sounds reasonably ok for uploading with PBO?

Same bottleneck __memcpy_sse2_unaligned is observed.
sample perf report output:

 28,20%  pboUnpack  libc-2.19.so            [.] __memcpy_sse2_unaligned
 16,63%  pboUnpack  pboUnpack               [.] 0x0000000000006542
  6,96%  pboUnpack  [kernel.kallsyms]       [k] clear_page_c_e
  2,52%  pboUnpack  [drm]                   [k]
drm_mm_insert_node_in_range_generic
  2,10%  pboUnpack  [kernel.kallsyms]       [k] get_page_from_freelist


backtrace:

__memcpy_sse2_unaligned () at
../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:86
86    ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: No such file or
directory.
(gdb) bt
#0  __memcpy_sse2_unaligned () at
../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:86
#1  0x00007ffff2bddbbd in memcpy (__len=4194304, __src=<optimized out>,
__dest=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/string3.h:51
#2  memcpy_texture (dimensions=dimensions at entry=2,
dstFormat=dstFormat at entry=MESA_FORMAT_B8G8R8A8_UNORM,
dstRowStride=dstRowStride at entry=4096, dstSlices=dstSlices at entry=0x7fffffffd6e8,

    srcWidth=srcWidth at entry=1024, srcHeight=srcHeight at entry=1024,
srcDepth=srcDepth at entry=1, srcFormat=srcFormat at entry=32993,
srcType=srcType at entry=5121, srcAddr=srcAddr at entry=0x7fffeeecd000,
    srcPacking=srcPacking at entry=0x7ffff7f69180, ctx=<optimized out>) at
../../../../src/mesa/main/texstore.c:949
#3  0x00007ffff2be353d in _mesa_texstore_memcpy (srcPacking=0x7ffff7f69180,
srcAddr=<optimized out>, srcType=5121, srcFormat=32993, srcDepth=<optimized
out>, srcHeight=<optimized out>,
    srcWidth=<optimized out>, dstSlices=<optimized out>,
dstRowStride=<optimized out>, dstFormat=MESA_FORMAT_B8G8R8A8_UNORM,
baseInternalFormat=6408, dims=<optimized out>, ctx=0x7ffff7f4d010)
    at ../../../../src/mesa/main/texstore.c:3938
#4  _mesa_texstore (ctx=0x7ffff7f4d010, dims=2, baseInternalFormat=6408,
dstFormat=MESA_FORMAT_B8G8R8A8_UNORM, dstRowStride=4096,
dstSlices=0x7fffffffd6e8, srcWidth=1024, srcHeight=1024, srcDepth=1,
    srcFormat=32993, srcType=5121, srcAddr=0x7fffeeecd000,
srcPacking=0x7ffff7f69180) at ../../../../src/mesa/main/texstore.c:3958
#5  0x00007ffff2be3812 in store_texsubimage (ctx=ctx at entry=0x7ffff7f4d010,
texImage=texImage at entry=0x7c8690, xoffset=xoffset at entry=0,
yoffset=yoffset at entry=0, zoffset=zoffset at entry=0, width=1024,
    height=1024, depth=1, format=32993, type=5121, pixels=0x0,
packing=0x7ffff7f69180, caller=0x7ffff2d609c7 "glTexSubImage") at
../../../../src/mesa/main/texstore.c:4107
#6  0x00007ffff2be3aa5 in _mesa_store_texsubimage
(ctx=ctx at entry=0x7ffff7f4d010,
dims=<optimized out>, texImage=texImage at entry=0x7c8690,
xoffset=xoffset at entry=0, yoffset=yoffset at entry=0,
    zoffset=zoffset at entry=0, width=<optimized out>, width at entry=1024,
height=<optimized out>, height at entry=1024, depth=<optimized out>,
depth at entry=1, format=<optimized out>, format at entry=32993,
    type=<optimized out>, type at entry=5121, pixels=<optimized out>,
pixels at entry=0x0, packing=<optimized out>, packing at entry=0x7ffff7f69180) at
../../../../src/mesa/main/texstore.c:4171
#7  0x00007ffff2c3acaa in st_TexSubImage (ctx=0x7ffff7f4d010,
dims=<optimized out>, texImage=0x7c8690, xoffset=0, yoffset=0, zoffset=0,
width=1024, height=1024, depth=1, format=32993, type=5121,
    pixels=0x0, unpack=0x7ffff7f69180) at
../../../../src/mesa/state_tracker/st_cb_texture.c:787
#8  0x00007ffff2bce83d in texsubimage (ctx=0x7ffff7f4d010, dims=dims at entry=2,
target=3553, level=0, xoffset=0, yoffset=0, zoffset=zoffset at entry=0,
width=1024, height=1024, depth=depth at entry=1,
    format=format at entry=32993, type=type at entry=5121, pixels=pixels at entry=0x0)
at ../../../../src/mesa/main/teximage.c:3445
#9  0x00007ffff2bd259c in _mesa_TexSubImage2D (target=<optimized out>,
level=<optimized out>, xoffset=<optimized out>, yoffset=<optimized out>,
width=<optimized out>, height=<optimized out>,
    format=32993, type=5121, pixels=0x0) at
../../../../src/mesa/main/teximage.c:3483


pixels pointer in st_texSubImage is 0x0 here, maybe because it's an
internal pbo to texture transfer?
srcAddr in memcpy_texture() is 0x7fffeeecd000 which looks sufficiently
aligned, but maybe this is not the correct pointer to look at.

could there also be a CPU stall/sync issue when mapping a pbo buffer?

Similar pbounpack/memcpy performance discussed a bit here recently with no
conclusion:
http://people.freedesktop.org/~cbrill/dri-log/?channel=dri-devel&date=2015-01-01

thanks,

- Vasilis




>
> Cheers,
> Daniel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/dri-devel/attachments/20150409/876e926a/attachment.html>


More information about the dri-devel mailing list