[Intel-gfx] GEM object write
Shaohua Li
shaohua.li at intel.com
Wed Apr 1 03:15:34 CEST 2009
On Wed, 2009-04-01 at 02:44 +0800, Eric Anholt wrote:
> On Tue, 2009-03-31 at 09:50 +0800, Shaohua Li wrote:
> > On Tue, 2009-03-31 at 00:32 +0800, Keith Packard wrote:
> > > On Mon, 2009-03-30 at 09:19 +0800, Shaohua Li wrote:
> > > > Hi,
> > > > I recently did some benchmarks with different GEM object write methods
> > > >
> > > > 1. bo_map.
> > > > This approach will memory map the gem object to write-back, and then
> > > > flush cache to memory. I did a benchmark to compare mapping memory to
> > > > write-back (and then clflush cache) and write-combine. In my test,
> > > > write-combine is about 3 times faster than the write-back (please try
> > > > with attached module). The data is quite stable in my test.
> > > >
> > > > 2. pwrite
> > > > pwrite almost will map gem object to write-combine (if the gem object is
> > > > in gtt, and this is almost true in general case), but it adds extra
> > > > copy. In my XVMC test, pwrite approach cause 20% performance lost.
> > > >
> > > > 3. bo_map_gtt
> > > > this approach will bind gem object to gtt and map object as
> > > > write-combine. This is the fastest approach and equal to the performance
> > > > without GEM, but the object should be bound to gtt and can't be swapped
> > > > out as the mapping is for a device.
> > >
> > > Your example is biased in favor of WC mapping as it only writes 1 byte
> > > in 64 to the object. I propose a slightly different test which would
> > > model how we expect rendering operations to access memory (at least for
> > > streaming data from CPU to GPU):
> > >
> > > 1. Allocate a large pool of memory to simulate pages behind the
> > > aperture
> > > 2. For pwrite mappings:
> > > 1. allocate a small (32kB) pool of memory
> > > 2. Write data to the small buffer
> > > 3. Copy that data to the "aperture"
> > > 4. clflush
> > > 3. For WB mappings
> > > 1. Write data to the "aperture"
> > > 2. clflush
> > > 4. For WC mappings
> > > 1. Write data to the "aperture"
> > >
> > > In each case, the writes should be 4 bytes, aligned on 4-byte
> > > boundaries, and the writes should fill the nominal buffer size (32kB),
> > > and you should use a different section of the aperture, as a streaming
> > > application would.
> > >
> > > Given that WC mapping is only 3x slower than WB mapping + clflush, when
> > > writing only 1/64 of a cache line each time, I think it will be
> > > interesting to see how this works when writing the full amount of data.
> > Just tried the 4byte access. the result for WB/WC mapping isn't changed.
> > WC mapping is still about 3x faster than WB mapping + clflush. please
> > give a try.
> > I'll do a benchmark for pwrite mapping later.
>
> I've actually gone and written those benchmarks -- I did it while
> working on the locking changes last week. I think my test is also more
> useful as it actually renders using the result, so it gets closer to
> "real world". Check out the repo I've put up at:
>
> git://anongit.freedesktop.org/git/xorg/app/intel-gpu-tools
>
> This should be the new home for our various userland tools. I'm hoping
> to get the time to build a bunch more regression tests for the DRM,
> since catching them in our userland drivers before users do has proved
> rather ineffective so far.
>
> So, results for pwrite versus bo_map_gtt+write versus bo_map+write on a
> couple systems (values in MB/s):
>
> x g45-upload_blit_large.txt
> + g45-upload_blit_large_gtt.txt
> * g45-upload_blit_large_map.txt
> +------------------------------------------------------------------------------+
> | * |
> |* * x + |
> |* * x x x + + ++|
> ||_AM |__M_A___| |_A__||
> +------------------------------------------------------------------------------+
> N Min Max Median Avg Stddev
> x 5 2302.7 2525.2 2351.2 2374.68 86.519287
> + 5 2705.1 2797.2 2744.5 2747 42.522053
> Difference at 95.0% confidence
> 372.32 +/- 99.4189
> 15.6787% +/- 4.18662%
> (Student's t, pooled s = 68.1679)
> * 5 1409.2 1463.5 1461.2 1441.58 28.984254
> Difference at 95.0% confidence
> -933.1 +/- 94.0988
> -39.2937% +/- 3.96259%
> (Student's t, pooled s = 64.5201)
>
> Summary: bo_map_gtt 16% faster than pwrite for large uploads, bo_map 39%
> slower. Nothing really shocked me here.
>
> x 945gm-upload_blit_large.txt
> + 945gm-upload_blit_large_gtt.txt
> * 945gm-upload_blit_large_map.txt
> +------------------------------------------------------------------------------+
> |+ *|
> |+ x *|
> |+ x *|
> |+ x *|
> |+ xx *|
> |A |A A|
> +------------------------------------------------------------------------------+
> N Min Max Median Avg Stddev
> x 5 602.8 608.1 604.6 605.28 2.137054
> + 5 104.2 104.8 104.8 104.66 0.2607681
> Difference at 95.0% confidence
> -500.62 +/- 2.22024
> -82.7088% +/- 0.366811%
> (Student's t, pooled s = 1.52233)
> * 5 670.9 673.6 672.6 672.42 1.0568822
> Difference at 95.0% confidence
> 67.14 +/- 2.45868
> 11.0924% +/- 0.406205%
> (Student's t, pooled s = 1.68582)
>
> Summary: bo_map_gtt was 83% slower than pwrite for large uploads. It
> looks like we're getting an uncached mapping or something going on here.
> 99% cpu was spent in the function that writes data into the map, with no
> kernel time. bo_map was surprisingly 11% *faster* than pwrite.
is PAT enabled in kernel? the result difference is too big.
More information about the Intel-gfx
mailing list