[Intel-gfx] [PATCH] drm/i915 : Avoid superfluous invalidation of CPU cache lines

Chris Wilson chris at chris-wilson.co.uk
Thu Nov 26 02:57:13 PST 2015


On Thu, Nov 26, 2015 at 09:09:37AM +0530, Goel, Akash wrote:
> 
> 
> On 11/25/2015 10:58 PM, Chris Wilson wrote:
> >On Wed, Nov 25, 2015 at 01:02:20PM +0200, Ville Syrjälä wrote:
> >>On Tue, Nov 24, 2015 at 10:39:38PM +0000, Chris Wilson wrote:
> >>>On Tue, Nov 24, 2015 at 07:14:31PM +0100, Daniel Vetter wrote:
> >>>>On Tue, Nov 24, 2015 at 12:04:06PM +0200, Ville Syrjälä wrote:
> >>>>>On Tue, Nov 24, 2015 at 03:35:24PM +0530, akash.goel at intel.com wrote:
> >>>>>>From: Akash Goel <akash.goel at intel.com>
> >>>>>>
> >>>>>>When the object is moved out of CPU read domain, the cachelines
> >>>>>>are not invalidated immediately. The invalidation is deferred till
> >>>>>>next time the object is brought back into CPU read domain.
> >>>>>>But the invalidation is done unconditionally, i.e. even for the case
> >>>>>>where the cachelines were flushed previously, when the object moved out
> >>>>>>of CPU write domain. This is avoidable and would lead to some optimization.
> >>>>>>Though this is not a hypothetical case, but is unlikely to occur often.
> >>>>>>The aim is to detect changes to the backing storage whilst the
> >>>>>>data is potentially in the CPU cache, and only clflush in those case.
> >>>>>>
> >>>>>>Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> >>>>>>Signed-off-by: Akash Goel <akash.goel at intel.com>
> >>>>>>---
> >>>>>>  drivers/gpu/drm/i915/i915_drv.h | 1 +
> >>>>>>  drivers/gpu/drm/i915/i915_gem.c | 9 ++++++++-
> >>>>>>  2 files changed, 9 insertions(+), 1 deletion(-)
> >>>>>>
> >>>>>>diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> >>>>>>index df9316f..fedb71d 100644
> >>>>>>--- a/drivers/gpu/drm/i915/i915_drv.h
> >>>>>>+++ b/drivers/gpu/drm/i915/i915_drv.h
> >>>>>>@@ -2098,6 +2098,7 @@ struct drm_i915_gem_object {
> >>>>>>  	unsigned long gt_ro:1;
> >>>>>>  	unsigned int cache_level:3;
> >>>>>>  	unsigned int cache_dirty:1;
> >>>>>>+	unsigned int cache_clean:1;
> >>>>>
> >>>>>So now we have cache_dirty and cache_clean which seems redundant,
> >>>>>except somehow cache_dirty != !cache_clean?
> >>>
> >>>Exactly, not entirely redundant. I did think something along MESI lines
> >>>would be useful, but that didn't capture the different meanings we
> >>>employ.
> >>>
> >>>cache_dirty tracks whether we have been eliding the clflush.
> >>>
> >>>cache_clean tracks whether we know the cache has been completely
> >>>clflushed.
> >>
> >>Can we know that with speculative prefetching and whatnot?
> >
> >"The memory attribute of the page containing the affected line has no
> >effect on the behavior of this instruction. It should be noted that
> >processors are free to speculative fetch and cache data from system
> >memory regions assigned a memory-type allowing for speculative reads
> >(i.e. WB, WC, WT memory types). The Streaming SIMD Extensions PREFETCHh
> >instruction is considered a hint to this speculative behavior. Because
> >this speculative fetching can occur at any time and is not tied to
> >instruction execution, CLFLUSH is not ordered with respect to PREFETCHh
> >or any of the speculative fetching mechanisms (that is, data could be
> >speculative loaded into the cache just before, during, or after the
> >execution of a CLFLUSH to that cache line)."
> >
> >which taken to the extreme means that we can't get away with this trick.
> >
> >If we can at least guarantee that such speculation can't extend beyond
> >a page boundary that will be enough to assert that the patch is valid.
> >
> >Hopefully someone knows a CPU guru or two.
> 
> Found some relevant info at the link https://lwn.net/Articles/255364/
> 
> An excerpt from the same link
> Hardware Prefetching
> "Prefetching has one big weakness: it cannot cross page boundaries.
> The reason should be obvious when one realizes that the CPUs support
> demand paging. If the prefetcher were allowed to cross page
> boundaries, the access might trigger an OS event to make the page
> available. This by itself can be bad, especially for performance.
> What is worse is that the prefetcher does not know about the
> semantics of the program or the OS itself. It might therefore
> prefetch pages which, in real life, never would be requested. That
> means the prefetcher would run past the end of the memory region the
> processor accessed in a recognizable pattern before. This is not
> only possible, it is very likely. If the processor, as a side effect
> of a prefetch, triggered a request for such a page the OS might even
> be completely thrown off its tracks if such a request could never
> otherwise happen.
> 
> It is therefore important to realize that, regardless of how good
> the prefetcher is at predicting the pattern, the program will
> experience cache misses at page boundaries unless it explicitly
> prefetches or reads from the new page. This is another reason to
> optimize the layout of data as described in Section 6.2 to minimize
> cache pollution by keeping unrelated data out.
> 
> Because of this page limitation the processors do not have terribly
> sophisticated logic to recognize prefetch patterns. With the still
> predominant 4k page size there is only so much which makes sense.
> The address range in which strides are recognized has been increased
> over the years, but it probably does not make much sense to go
> beyond the 512 byte window which is often used today. Currently
> prefetch units do not recognize non-linear access pattern"

How best to summarise? Add something like
 * ...
 * After clflushing we know that this object cannot be in the
 * CPU cache, nor can it be speculatively loaded into the CPU
 * cache as our objects are page-aligned (and speculation cannot
 * cross page boundaries). Whilst this flag is set, we know that
 * any future access to the object's pages will miss the stale
 * cache and have to be serviced from main memory, i.e. we do
 * not need another clflush to invalidate the CPU cache.
?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre


More information about the Intel-gfx mailing list