[Intel-gfx] GEM object write

Wed Apr 1 03:15:34 CEST 2009

On Wed, 2009-04-01 at 02:44 +0800, Eric Anholt wrote:
> On Tue, 2009-03-31 at 09:50 +0800, Shaohua Li wrote:
> > On Tue, 2009-03-31 at 00:32 +0800, Keith Packard wrote:
> > > On Mon, 2009-03-30 at 09:19 +0800, Shaohua Li wrote:
> > > > Hi,
> > > > I recently did some benchmarks with different GEM object write methods
> > > > 
> > > > 1. bo_map.
> > > > This approach will memory map the gem object to write-back, and then
> > > > flush cache to memory. I did a benchmark to compare mapping memory to
> > > > write-back (and then clflush cache) and write-combine. In my test,
> > > > write-combine is about 3 times faster than the write-back (please try
> > > > with attached module). The data is quite stable in my test.
> > > > 
> > > > 2. pwrite
> > > > pwrite almost will map gem object to write-combine (if the gem object is
> > > > in gtt, and this is almost true in general case), but it adds extra
> > > > copy. In my XVMC test, pwrite approach cause 20% performance lost.
> > > > 
> > > > 3. bo_map_gtt
> > > > this approach will bind gem object to gtt and map object as
> > > > write-combine. This is the fastest approach and equal to the performance
> > > > without GEM, but the object should be bound to gtt and can't be swapped
> > > > out as the mapping is for a device.
> > > 
> > > Your example is biased in favor of WC mapping as it only writes 1 byte
> > > in 64 to the object. I propose a slightly different test which would
> > > model how we expect rendering operations to access memory (at least for
> > > streaming data from CPU to GPU):
> > > 
> > >      1. Allocate a large pool of memory to simulate pages behind the
> > >         aperture
> > >      2. For pwrite mappings:
> > >              1. allocate a small (32kB) pool of memory
> > >              2. Write data to the small buffer
> > >              3. Copy that data to the "aperture"
> > >              4. clflush
> > >      3. For WB mappings
> > >              1. Write data to the "aperture"
> > >              2. clflush
> > >      4. For WC mappings
> > >              1. Write data to the "aperture"
> > > 
> > > In each case, the writes should be 4 bytes, aligned on 4-byte
> > > boundaries, and the writes should fill the nominal buffer size (32kB),
> > > and you should use a different section of the aperture, as a streaming
> > > application would.
> > > 
> > > Given that WC mapping is only 3x slower than WB mapping + clflush, when
> > > writing only 1/64 of a cache line each time, I think it will be
> > > interesting to see how this works when writing the full amount of data.
> > Just tried the 4byte access. the result for WB/WC mapping isn't changed.
> > WC mapping is still about 3x faster than WB mapping + clflush. please
> > give a try.
> > I'll do a benchmark for pwrite mapping later.
> 
> I've actually gone and written those benchmarks -- I did it while
> working on the locking changes last week.  I think my test is also more
> useful as it actually renders using the result, so it gets closer to
> "real world".  Check out the repo I've put up at:
> 
> git://anongit.freedesktop.org/git/xorg/app/intel-gpu-tools
> 
> This should be the new home for our various userland tools.  I'm hoping
> to get the time to build a bunch more regression tests for the DRM,
> since catching them in our userland drivers before users do has proved
> rather ineffective so far.
> 
> So, results for pwrite versus bo_map_gtt+write versus bo_map+write on a
> couple systems (values in MB/s):
> 
> x g45-upload_blit_large.txt
> + g45-upload_blit_large_gtt.txt
> * g45-upload_blit_large_map.txt
> +------------------------------------------------------------------------------+
> |   *                                                                          |
> |*  *                                                x                   +     |
> |*  *                                              x x         x         + + ++|
> ||_AM                                             |__M_A___|             |_A__||
> +------------------------------------------------------------------------------+
>     N           Min           Max        Median           Avg        Stddev
> x   5        2302.7        2525.2        2351.2       2374.68     86.519287
> +   5        2705.1        2797.2        2744.5          2747     42.522053
> Difference at 95.0% confidence
> 	372.32 +/- 99.4189
> 	15.6787% +/- 4.18662%
> 	(Student's t, pooled s = 68.1679)
> *   5        1409.2        1463.5        1461.2       1441.58     28.984254
> Difference at 95.0% confidence
> 	-933.1 +/- 94.0988
> 	-39.2937% +/- 3.96259%
> 	(Student's t, pooled s = 64.5201)
> 
> Summary: bo_map_gtt 16% faster than pwrite for large uploads, bo_map 39%
> slower.  Nothing really shocked me here.
> 
> x 945gm-upload_blit_large.txt
> + 945gm-upload_blit_large_gtt.txt
> * 945gm-upload_blit_large_map.txt
> +------------------------------------------------------------------------------+
> |+                                                                            *|
> |+                                                                   x        *|
> |+                                                                   x        *|
> |+                                                                   x        *|
> |+                                                                  xx        *|
> |A                                                                  |A        A|
> +------------------------------------------------------------------------------+
>     N           Min           Max        Median           Avg        Stddev
> x   5         602.8         608.1         604.6        605.28      2.137054
> +   5         104.2         104.8         104.8        104.66     0.2607681
> Difference at 95.0% confidence
> 	-500.62 +/- 2.22024
> 	-82.7088% +/- 0.366811%
> 	(Student's t, pooled s = 1.52233)
> *   5         670.9         673.6         672.6        672.42     1.0568822
> Difference at 95.0% confidence
> 	67.14 +/- 2.45868
> 	11.0924% +/- 0.406205%
> 	(Student's t, pooled s = 1.68582)
> 
> Summary: bo_map_gtt was 83% slower than pwrite for large uploads.  It
> looks like we're getting an uncached mapping or something going on here.
> 99% cpu was spent in the function that writes data into the map, with no
> kernel time.  bo_map was surprisingly 11% *faster* than pwrite.
is PAT enabled in kernel? the result difference is too big.