[Intel-gfx] GEM object write

Shaohua Li shaohua.li at intel.com
Tue Mar 31 10:43:59 CEST 2009


On Tue, 2009-03-31 at 14:56 +0800, Ma, Ling wrote:
> Hi,
> 
> I did another test program based on original one,
> 
> The test result shows WB  is faster than WC - WC/WB is about 8369/4421.
> In this file I use movnti instruction to write in order to avoid  much clflush instruction.
this doesn't mean WB is faster. movnti actually is using write combine
protocol, it's not WB.

My test result with your test case is WC has about 8% lost. (insmod
test.ko phy_size=0x8000 times=10000), not that bigger as yours. It's
still a mystery to me why movnti is faster than WC, considering both
them are doing write-combine. 

pwrite is using movnti, but in my test it's not fast, as it involve
extra copy.

I wonder how to utilize movnti for gem object mapping write. In common
cases, application just maps a gem object and do write with common
instruction. To use movnti, we must provide API and force application to
use it.

Thanks,
Shaohua
>  may be we can do some optimization on it.
> 
> Thanks
> Ma Ling 
> 
> -----Original Message-----
> From: intel-gfx-bounces at lists.freedesktop.org [mailto:intel-gfx-bounces at lists.freedesktop.org] On Behalf Of Shaohua Li
> Sent: Tuesday, March 31, 2009 9:50 AM
> To: Keith Packard
> Cc: intel-gfx at lists.freedesktop.org
> Subject: Re: [Intel-gfx] GEM object write
> 
> On Tue, 2009-03-31 at 00:32 +0800, Keith Packard wrote:
> > On Mon, 2009-03-30 at 09:19 +0800, Shaohua Li wrote:
> > > Hi,
> > > I recently did some benchmarks with different GEM object write methods
> > > 
> > > 1. bo_map.
> > > This approach will memory map the gem object to write-back, and then
> > > flush cache to memory. I did a benchmark to compare mapping memory to
> > > write-back (and then clflush cache) and write-combine. In my test,
> > > write-combine is about 3 times faster than the write-back (please try
> > > with attached module). The data is quite stable in my test.
> > > 
> > > 2. pwrite
> > > pwrite almost will map gem object to write-combine (if the gem object is
> > > in gtt, and this is almost true in general case), but it adds extra
> > > copy. In my XVMC test, pwrite approach cause 20% performance lost.
> > > 
> > > 3. bo_map_gtt
> > > this approach will bind gem object to gtt and map object as
> > > write-combine. This is the fastest approach and equal to the performance
> > > without GEM, but the object should be bound to gtt and can't be swapped
> > > out as the mapping is for a device.
> > 
> > Your example is biased in favor of WC mapping as it only writes 1 byte
> > in 64 to the object. I propose a slightly different test which would
> > model how we expect rendering operations to access memory (at least for
> > streaming data from CPU to GPU):
> > 
> >      1. Allocate a large pool of memory to simulate pages behind the
> >         aperture
> >      2. For pwrite mappings:
> >              1. allocate a small (32kB) pool of memory
> >              2. Write data to the small buffer
> >              3. Copy that data to the "aperture"
> >              4. clflush
> >      3. For WB mappings
> >              1. Write data to the "aperture"
> >              2. clflush
> >      4. For WC mappings
> >              1. Write data to the "aperture"
> > 
> > In each case, the writes should be 4 bytes, aligned on 4-byte
> > boundaries, and the writes should fill the nominal buffer size (32kB),
> > and you should use a different section of the aperture, as a streaming
> > application would.
> > 
> > Given that WC mapping is only 3x slower than WB mapping + clflush, when
> > writing only 1/64 of a cache line each time, I think it will be
> > interesting to see how this works when writing the full amount of data.
> Just tried the 4byte access. the result for WB/WC mapping isn't changed.
> WC mapping is still about 3x faster than WB mapping + clflush. please
> give a try.
> I'll do a benchmark for pwrite mapping later.
> 
> Thanks,
> Shaohua




More information about the Intel-gfx mailing list