xf86XVCopyPacked() and friends : why so slow ?

Thu Feb 4 16:45:29 PST 2010

When playing some video with mplayer I noticed with oprofile that
half the time is spent in xf86XVCopyPacked() or xf86XVCopyYUV12ToPacked().

Looking at the former, I wonder why a mere memcpy was not used instead
of "manually" copying each words. glibc's memcpy is usually optimized
for the target architecture while there is little the compiler can do
to optimize given code.
Also, for the plannar to packed version, you can achieve much better
performance using vector instructions, but it's less easy to do it
portably.

So I suppose there is a good reason why these functions are so slow.
Maybe because the video driver are supposed to propose better ones ?
Or maybe because it's planned to use an external library like pixman
to do this kind of job in the future ?

More to the point, what I'm trying to know is weither I'm supposed to
optimize my video driver to not use these functions, or if it's OK to
optimize them instead, and what path I should follow ?