[Intel-gfx] gem clflush optimization for media encoding

Zou, Nanhai nanhai.zou at intel.com
Wed Jun 22 06:37:03 CEST 2011



>>-----Original Message-----
>>From: intel-gfx-bounces+nanhai.zou=intel.com at lists.freedesktop.org
>>[mailto:intel-gfx-bounces+nanhai.zou=intel.com at lists.freedesktop.org] On
>>Behalf Of Zou, Nanhai
>>Sent: 2011年6月22日 12:29
>>To: Keith Packard; intel-gfx at lists.freedesktop.org
>>Cc: Anholt, Eric
>>Subject: Re: [Intel-gfx] gem clflush optimization for media encoding
>>
>>
>>
>>>>-----Original Message-----
>>>>From: Keith Packard [mailto:keithp at keithp.com]
>>>>Sent: 2011年6月22日 12:14
>>>>To: Zou, Nanhai; intel-gfx at lists.freedesktop.org
>>>>Cc: Anholt, Eric
>>>>Subject: Re: [Intel-gfx] gem clflush optimization for media encoding
>>>>
>>>>On Wed, 22 Jun 2011 11:13:09 +0800, "Zou, Nanhai" <nanhai.zou at intel.com>
>>wrote:
>>>>
>>>>> 	If I upload input buffer with movnti or movntdq (bypass cache) +
>>>>> 	sfence(clear write combine buffer) in the end, clflush should
>>>>> 	not be needed.
>>>>
>>>>Alas, neither of these will flush existing cached data, so you must
>>>>still use clflush to ensure that the data makes it out to memory. All
>>>>that they do is avoid consuming additional cache lines.
>>>>
>>  As I understand,
>>  with movnti + sfence, data should be surly reach memory. Cache should be
>>coherent at this case.
>>
>>>>You want to use a write combining mapping, which should give you full
>>>>bandwidth access to memory without hitting any caches. You can use the GTT
>>>>mapping as the aperture is configured for write combining access, or we
>>>>can figure out how to make PAT work.
>>>>
>>	map_gtt in current gem is super slow.
>>	I've tried map_gtt but it seems that the speed is unacceptable.
>>
>>>>> 	Since it is CPU read only surface, clflush in not needed at all.
>>>>
>>>>You'd still have to invalidate cache lines using clflush to avoid using
>>>>stale data in the CPU cache.
>>>>
>>>>--
>>  Yes, you are right, in this case clflush is still needed to invalidate the
>>CPU cache.
>>
>>  The problem is that we do not now how large the coded output buffer is before
>>we do the encoding.
>>  So we have to allocate a large enough gem object before encoding, in most
>>case the encoding result will be less than 1/10 of the safe buffer size, 9/10
>>of the buffer was unnecessarily clflushed.
>>
>>  A fast map_gtt implementation could be the best choice here.
>>
	Or can we clflush cache line by cache line while reading instead of flush the entire object?
	This optimization will have >40% speedup for 1080p encoding.

>>Thanks
>>Zou Nanhai
>>
>>>>keith.packard at intel.com
>>_______________________________________________
>>Intel-gfx mailing list
>>Intel-gfx at lists.freedesktop.org
>>http://lists.freedesktop.org/mailman/listinfo/intel-gfx


More information about the Intel-gfx mailing list