[Intel-gfx] performance movnti is better than clflush ?

Fri Jul 24 13:20:46 CEST 2009

Hi All
I find movnti + mfence is better than clflush as below report shows (on core2 platform)

Size(byte)    movnti(us)   clflush (us)  speedup
4k             3.01            3.56        1.182
16k           12.01         14.23        1.184
32k           23.93         28.45        1.188
64k            47.92        56.89        1.187
The code for two cases (only care about alignment):

  Movnti + mfence                                          clflush
For (i = 0; i < size; i = i+ 64) {                                 For (i = 0; i < size; i = i + 64)
   __asm__("movq (addr + i), %rax);                              clflush(addr + i);
  __asm__("movntiq %rax,   (addr + i);
}
_-asm__ ("mfence")

Movnti will invalidate cache line before writing data into write combine buffer, at last we may use mfence to
drain out the left data in write combine buffer, and behavior looks like clflush.
The approach is only fit for small page, when size is bigger than about 128k(on my platform),
movnti + mfence approach get worse because read instruction.

If theory is right, we can get benefit from many flush operation in gem.

Thanks
Ma Ling

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/intel-gfx/attachments/20090724/217c9bed/attachment.html>