[Intel-gfx] GEM object write

Tue Mar 31 08:56:54 CEST 2009

Hi,

I did another test program based on original one,

The test result shows WB  is faster than WC - WC/WB is about 8369/4421.
In this file I use movnti instruction to write in order to avoid  much clflush instruction.
 may be we can do some optimization on it.

Thanks
Ma Ling 

-----Original Message-----
From: intel-gfx-bounces at lists.freedesktop.org [mailto:intel-gfx-bounces at lists.freedesktop.org] On Behalf Of Shaohua Li
Sent: Tuesday, March 31, 2009 9:50 AM
To: Keith Packard
Cc: intel-gfx at lists.freedesktop.org
Subject: Re: [Intel-gfx] GEM object write

On Tue, 2009-03-31 at 00:32 +0800, Keith Packard wrote:
> On Mon, 2009-03-30 at 09:19 +0800, Shaohua Li wrote:
> > Hi,
> > I recently did some benchmarks with different GEM object write methods
> > 
> > 1. bo_map.
> > This approach will memory map the gem object to write-back, and then
> > flush cache to memory. I did a benchmark to compare mapping memory to
> > write-back (and then clflush cache) and write-combine. In my test,
> > write-combine is about 3 times faster than the write-back (please try
> > with attached module). The data is quite stable in my test.
> > 
> > 2. pwrite
> > pwrite almost will map gem object to write-combine (if the gem object is
> > in gtt, and this is almost true in general case), but it adds extra
> > copy. In my XVMC test, pwrite approach cause 20% performance lost.
> > 
> > 3. bo_map_gtt
> > this approach will bind gem object to gtt and map object as
> > write-combine. This is the fastest approach and equal to the performance
> > without GEM, but the object should be bound to gtt and can't be swapped
> > out as the mapping is for a device.
> 
> Your example is biased in favor of WC mapping as it only writes 1 byte
> in 64 to the object. I propose a slightly different test which would
> model how we expect rendering operations to access memory (at least for
> streaming data from CPU to GPU):
> 
>      1. Allocate a large pool of memory to simulate pages behind the
>         aperture
>      2. For pwrite mappings:
>              1. allocate a small (32kB) pool of memory
>              2. Write data to the small buffer
>              3. Copy that data to the "aperture"
>              4. clflush
>      3. For WB mappings
>              1. Write data to the "aperture"
>              2. clflush
>      4. For WC mappings
>              1. Write data to the "aperture"
> 
> In each case, the writes should be 4 bytes, aligned on 4-byte
> boundaries, and the writes should fill the nominal buffer size (32kB),
> and you should use a different section of the aperture, as a streaming
> application would.
> 
> Given that WC mapping is only 3x slower than WB mapping + clflush, when
> writing only 1/64 of a cache line each time, I think it will be
> interesting to see how this works when writing the full amount of data.
Just tried the 4byte access. the result for WB/WC mapping isn't changed.
WC mapping is still about 3x faster than WB mapping + clflush. please
give a try.
I'll do a benchmark for pwrite mapping later.

Thanks,
Shaohua
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: test.c
URL: <http://lists.freedesktop.org/archives/intel-gfx/attachments/20090331/6022a569/attachment.c>