[Intel-gfx] GEM object write

Tue Mar 31 20:44:42 CEST 2009

On Tue, 2009-03-31 at 09:50 +0800, Shaohua Li wrote:
> On Tue, 2009-03-31 at 00:32 +0800, Keith Packard wrote:
> > On Mon, 2009-03-30 at 09:19 +0800, Shaohua Li wrote:
> > > Hi,
> > > I recently did some benchmarks with different GEM object write methods
> > > 
> > > 1. bo_map.
> > > This approach will memory map the gem object to write-back, and then
> > > flush cache to memory. I did a benchmark to compare mapping memory to
> > > write-back (and then clflush cache) and write-combine. In my test,
> > > write-combine is about 3 times faster than the write-back (please try
> > > with attached module). The data is quite stable in my test.
> > > 
> > > 2. pwrite
> > > pwrite almost will map gem object to write-combine (if the gem object is
> > > in gtt, and this is almost true in general case), but it adds extra
> > > copy. In my XVMC test, pwrite approach cause 20% performance lost.
> > > 
> > > 3. bo_map_gtt
> > > this approach will bind gem object to gtt and map object as
> > > write-combine. This is the fastest approach and equal to the performance
> > > without GEM, but the object should be bound to gtt and can't be swapped
> > > out as the mapping is for a device.
> > 
> > Your example is biased in favor of WC mapping as it only writes 1 byte
> > in 64 to the object. I propose a slightly different test which would
> > model how we expect rendering operations to access memory (at least for
> > streaming data from CPU to GPU):
> > 
> >      1. Allocate a large pool of memory to simulate pages behind the
> >         aperture
> >      2. For pwrite mappings:
> >              1. allocate a small (32kB) pool of memory
> >              2. Write data to the small buffer
> >              3. Copy that data to the "aperture"
> >              4. clflush
> >      3. For WB mappings
> >              1. Write data to the "aperture"
> >              2. clflush
> >      4. For WC mappings
> >              1. Write data to the "aperture"
> > 
> > In each case, the writes should be 4 bytes, aligned on 4-byte
> > boundaries, and the writes should fill the nominal buffer size (32kB),
> > and you should use a different section of the aperture, as a streaming
> > application would.
> > 
> > Given that WC mapping is only 3x slower than WB mapping + clflush, when
> > writing only 1/64 of a cache line each time, I think it will be
> > interesting to see how this works when writing the full amount of data.
> Just tried the 4byte access. the result for WB/WC mapping isn't changed.
> WC mapping is still about 3x faster than WB mapping + clflush. please
> give a try.
> I'll do a benchmark for pwrite mapping later.

I've actually gone and written those benchmarks -- I did it while
working on the locking changes last week.  I think my test is also more
useful as it actually renders using the result, so it gets closer to
"real world".  Check out the repo I've put up at:

git://anongit.freedesktop.org/git/xorg/app/intel-gpu-tools

This should be the new home for our various userland tools.  I'm hoping
to get the time to build a bunch more regression tests for the DRM,
since catching them in our userland drivers before users do has proved
rather ineffective so far.

So, results for pwrite versus bo_map_gtt+write versus bo_map+write on a
couple systems (values in MB/s):

x g45-upload_blit_large.txt
+ g45-upload_blit_large_gtt.txt
* g45-upload_blit_large_map.txt
+------------------------------------------------------------------------------+
|   *                                                                          |
|*  *                                                x                   +     |
|*  *                                              x x         x         + + ++|
||_AM                                             |__M_A___|             |_A__||
+------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5        2302.7        2525.2        2351.2       2374.68     86.519287
+   5        2705.1        2797.2        2744.5          2747     42.522053
Difference at 95.0% confidence
	372.32 +/- 99.4189
	15.6787% +/- 4.18662%
	(Student's t, pooled s = 68.1679)
*   5        1409.2        1463.5        1461.2       1441.58     28.984254
Difference at 95.0% confidence
	-933.1 +/- 94.0988
	-39.2937% +/- 3.96259%
	(Student's t, pooled s = 64.5201)

Summary: bo_map_gtt 16% faster than pwrite for large uploads, bo_map 39%
slower.  Nothing really shocked me here.

x 945gm-upload_blit_large.txt
+ 945gm-upload_blit_large_gtt.txt
* 945gm-upload_blit_large_map.txt
+------------------------------------------------------------------------------+
|+                                                                            *|
|+                                                                   x        *|
|+                                                                   x        *|
|+                                                                   x        *|
|+                                                                  xx        *|
|A                                                                  |A        A|
+------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5         602.8         608.1         604.6        605.28      2.137054
+   5         104.2         104.8         104.8        104.66     0.2607681
Difference at 95.0% confidence
	-500.62 +/- 2.22024
	-82.7088% +/- 0.366811%
	(Student's t, pooled s = 1.52233)
*   5         670.9         673.6         672.6        672.42     1.0568822
Difference at 95.0% confidence
	67.14 +/- 2.45868
	11.0924% +/- 0.406205%
	(Student's t, pooled s = 1.68582)

Summary: bo_map_gtt was 83% slower than pwrite for large uploads.  It
looks like we're getting an uncached mapping or something going on here.
99% cpu was spent in the function that writes data into the map, with no
kernel time.  bo_map was surprisingly 11% *faster* than pwrite.

As usual with microbenchmarks, there are a bunch of problems that make
them not really reflect regular usage.  In particular I'm concerned
about the fact that I'm not generating my data into a side memory space
and then copying it in for bo_map and bo_map_gtt modes.  It means that
bo_map_* get a pass on some cache effects, even though most consumers
(pixmap upload, traditional GL texture upload, GL vertex arrays, some
use of GL vbos) do generation to system memory before sending to the
GPU.  Still, I think the microbenchmarks are useful since it's made it
obvious that we've got something nasty going on with GTT mapping on my
945GM.

Oh, and for anyone interested in the little graphs, they're generated
with:
http://www.freebsd.org/cgi/cvsweb.cgi/src/usr.bin/ministat/ministat.c?rev=1.14;content-type=text%2Fplain

-- 
Eric Anholt
eric at anholt.net                         eric.anholt at intel.com

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.freedesktop.org/archives/intel-gfx/attachments/20090331/322974d2/attachment.sig>