[Mesa-dev] Possible ideas for optimisations in Mesa

Tue May 12 23:09:08 PDT 2015

On 05/12/2015 03:12 PM, Timothy Arceri wrote:
> On Sat, 2015-04-18 at 12:26 +0200, Marek Olšák wrote:
>> On Fri, Apr 17, 2015 at 1:21 PM, Timothy Arceri <t_arceri at yahoo.com.au> wrote:
>>> Hi all,
>>>
>>> Last year I spent a whole bunch of time profiling Mesa looking for areas
>>> where improvements could be made. Anyway I thought I'd point out a
>>> couple of things, and see if anyone thinks these are worthwhile
>>> following up.
>>>
>>> 1. While the hash table has been getting a lot of attention lately,
>>> after running the TF2 benchmark one place that showed up as using more
>>> cpu  than the hash table was the glsl parser. I guess this can be mostly
>>> solved once mesa has a disk cache for shaders.
>>>
>>> But something I came across at the time was this paper describing
>>> modifying (with apparently little effort) bison to generate a hardcoded
>>> parser that 2.5-6.5 times faster will generating a slightly bigger
>>> binary [1].
>>>
>>> The resulting project has been lost in the sands of time unfortunately
>>> so I couldn't try it out.
>>>
>>> 2. On most of the old quake engine benchmarks the Intel driver spends
>>> between 3-4.5% of its time or 400 million calls to glib since this code
>>> can't be inlined in this bit of code from copy_array_to_vbo_array():
>>>
>>>       while (count--) {
>>>          memcpy(dst, src, dst_stride);
>>>          src += src_stride;
>>>          dst += dst_stride;
>>>       }
>>>
>>> I looked in other drivers but I couldn't see them doing this kind of
>>> thing. I'd imaging because of its nature this code could be a bottle
>>> neck. Is there any easy ways to avoid doing this type of copy? Or would
>>> the only thing possible be to write a complex optimisation?
>>
>> Yeah, other drivers don't do this. In Gallium, we don't change the
>> stride when uploading buffers, so in our case src_stride ==
>> dst_stride.
>>
> 
> Thanks Marek. Looking at the history of the Intel code in git it seems
> when the code was first written memcpy() wasn't used and the data was
> just copied 8-bits at a time. In this case you can see the advantage of
> doing the copy this way, however with the use of memcpy() there doesn't
> seem to be much of a difference between the code paths.
> 
> Out of interest I implemented my own version of memcpy() that can do the
> copy's with mismatched strides. I did this by aligning the memory to the
> 8-bytes, doing some shifts in temporaries if needed and then doing 64bit
> copy's. 
> It was made simpler for my test case because the strides were always
> 12-bytes = dst, 16-bytes = src.
> In the end my memcpy() used slightly less cpu and could give a
> measurable boost in frame rate in the UrbanTerror benchmark, although
> the boost isn't always measurable and is mostly about the same. I
> suspect the boost only happens when memory isn't aligned to 8-bytes.
> 
> On average there seems to be around 150 to 200 of these copy's done each
> this this loop is hit in UrbanTerror so in theory my memcpy() may be
> able to be made even faster with SSE using load/store and some
> shuffling. I did attempt this but haven't got it work yet.

What kind of system were you measuring on?  You might measure a bigger
delta on a Bay Trail system, for example.  You might also try locking
the CPU clock low.

I know Eero has some tips for measuring small changes in CPU usage.  It
can be... annoying. :)

> In the end I'm not sure if implementing a custom memcpy() is worth all
> the effort but thought I'd post my findings. My memcpy() code is a bit
> of a mess at the moment but if anyone is interested I can clean it up
> and push it to my github repo, just let me know.
> 
> Tim 
> 
>> Marek
> 
> 
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev