[Mesa-dev] Possible ideas for optimisations in Mesa

Tue May 12 15:12:33 PDT 2015

On Sat, 2015-04-18 at 12:26 +0200, Marek Olšák wrote:
> On Fri, Apr 17, 2015 at 1:21 PM, Timothy Arceri <t_arceri at yahoo.com.au> wrote:
> > Hi all,
> >
> > Last year I spent a whole bunch of time profiling Mesa looking for areas
> > where improvements could be made. Anyway I thought I'd point out a
> > couple of things, and see if anyone thinks these are worthwhile
> > following up.
> >
> > 1. While the hash table has been getting a lot of attention lately,
> > after running the TF2 benchmark one place that showed up as using more
> > cpu  than the hash table was the glsl parser. I guess this can be mostly
> > solved once mesa has a disk cache for shaders.
> >
> > But something I came across at the time was this paper describing
> > modifying (with apparently little effort) bison to generate a hardcoded
> > parser that 2.5-6.5 times faster will generating a slightly bigger
> > binary [1].
> >
> > The resulting project has been lost in the sands of time unfortunately
> > so I couldn't try it out.
> >
> > 2. On most of the old quake engine benchmarks the Intel driver spends
> > between 3-4.5% of its time or 400 million calls to glib since this code
> > can't be inlined in this bit of code from copy_array_to_vbo_array():
> >
> >       while (count--) {
> >          memcpy(dst, src, dst_stride);
> >          src += src_stride;
> >          dst += dst_stride;
> >       }
> >
> > I looked in other drivers but I couldn't see them doing this kind of
> > thing. I'd imaging because of its nature this code could be a bottle
> > neck. Is there any easy ways to avoid doing this type of copy? Or would
> > the only thing possible be to write a complex optimisation?
> 
> Yeah, other drivers don't do this. In Gallium, we don't change the
> stride when uploading buffers, so in our case src_stride ==
> dst_stride.
> 

Thanks Marek. Looking at the history of the Intel code in git it seems
when the code was first written memcpy() wasn't used and the data was
just copied 8-bits at a time. In this case you can see the advantage of
doing the copy this way, however with the use of memcpy() there doesn't
seem to be much of a difference between the code paths.

Out of interest I implemented my own version of memcpy() that can do the
copy's with mismatched strides. I did this by aligning the memory to the
8-bytes, doing some shifts in temporaries if needed and then doing 64bit
copy's. 
It was made simpler for my test case because the strides were always
12-bytes = dst, 16-bytes = src.
In the end my memcpy() used slightly less cpu and could give a
measurable boost in frame rate in the UrbanTerror benchmark, although
the boost isn't always measurable and is mostly about the same. I
suspect the boost only happens when memory isn't aligned to 8-bytes.

On average there seems to be around 150 to 200 of these copy's done each
this this loop is hit in UrbanTerror so in theory my memcpy() may be
able to be made even faster with SSE using load/store and some
shuffling. I did attempt this but haven't got it work yet.

In the end I'm not sure if implementing a custom memcpy() is worth all
the effort but thought I'd post my findings. My memcpy() code is a bit
of a mess at the moment but if anyone is interested I can clean it up
and push it to my github repo, just let me know.

Tim 

> Marek