[Intel-gfx] GEM object write

Keith Packard keithp at keithp.com
Mon Mar 30 18:32:28 CEST 2009


On Mon, 2009-03-30 at 09:19 +0800, Shaohua Li wrote:
> Hi,
> I recently did some benchmarks with different GEM object write methods
> 
> 1. bo_map.
> This approach will memory map the gem object to write-back, and then
> flush cache to memory. I did a benchmark to compare mapping memory to
> write-back (and then clflush cache) and write-combine. In my test,
> write-combine is about 3 times faster than the write-back (please try
> with attached module). The data is quite stable in my test.
> 
> 2. pwrite
> pwrite almost will map gem object to write-combine (if the gem object is
> in gtt, and this is almost true in general case), but it adds extra
> copy. In my XVMC test, pwrite approach cause 20% performance lost.
> 
> 3. bo_map_gtt
> this approach will bind gem object to gtt and map object as
> write-combine. This is the fastest approach and equal to the performance
> without GEM, but the object should be bound to gtt and can't be swapped
> out as the mapping is for a device.

Your example is biased in favor of WC mapping as it only writes 1 byte
in 64 to the object. I propose a slightly different test which would
model how we expect rendering operations to access memory (at least for
streaming data from CPU to GPU):

     1. Allocate a large pool of memory to simulate pages behind the
        aperture
     2. For pwrite mappings:
             1. allocate a small (32kB) pool of memory
             2. Write data to the small buffer
             3. Copy that data to the "aperture"
             4. clflush
     3. For WB mappings
             1. Write data to the "aperture"
             2. clflush
     4. For WC mappings
             1. Write data to the "aperture"

In each case, the writes should be 4 bytes, aligned on 4-byte
boundaries, and the writes should fill the nominal buffer size (32kB),
and you should use a different section of the aperture, as a streaming
application would.

Given that WC mapping is only 3x slower than WB mapping + clflush, when
writing only 1/64 of a cache line each time, I think it will be
interesting to see how this works when writing the full amount of data.

> Since the real cause of performance lost is the memory mapping type, I
> suggest have a new API which is like bo_map_gtt, but don't bind object
> to GTT, that is just doing write-combine map. This still has the
> unswappable issue, but has a fast API. Any idea?

We could convert our pages to WC if that makes sense, although the
conversion process itself is quite expensive as it requires inter-core
interrupts to synchronize page tables.

-- 
keith.packard at intel.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.freedesktop.org/archives/intel-gfx/attachments/20090330/f71c6716/attachment.sig>


More information about the Intel-gfx mailing list