[PATCH 2/4] drm/cache: Try to be smarter about clflushing on x86

Sat Dec 13 20:15:22 PST 2014

On Sat, Dec 13, 2014 at 7:08 PM, Ben Widawsky
<benjamin.widawsky at intel.com> wrote:
> Any GEM driver which has very large objects and a slow CPU is subject to very
> long waits simply for clflushing incoherent objects. Generally, each individual
> object is not a problem, but if you have very large objects, or very many
> objects, the flushing begins to show up in profiles. Because on x86 we know the
> cache size, we can easily determine when an object will use all the cache, and
> forego iterating over each cacheline.
>
> We need to be careful when using wbinvd. wbinvd() is itself potentially slow
> because it requires synchronizing the flush across all CPUs so they have a
> coherent view of memory. This can result in either stalling work being done on
> other CPUs, or this call itself stalling while waiting for a CPU to accept the
> interrupt. Also, wbinvd() also has the downside of invalidating all cachelines,
> so we don't want to use it unless we're sure we already own most of the
> cachelines.
>
> The current algorithm is very naive. I think it can be tweaked more, and it
> would be good if someone else gave it some thought. I am pretty confident in
> i915, we can even skip the IPI in the execbuf path with minimal code change (or
> perhaps just some verifying of the existing code). It would be nice to hear what
> other developers who depend on this code think.
>
> Cc: Intel GFX <intel-gfx at lists.freedesktop.org>
> Signed-off-by: Ben Widawsky <ben at bwidawsk.net>
> ---
>  drivers/gpu/drm/drm_cache.c | 20 +++++++++++++++++---
>  1 file changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c
> index d7797e8..6009c2d 100644
> --- a/drivers/gpu/drm/drm_cache.c
> +++ b/drivers/gpu/drm/drm_cache.c
> @@ -64,6 +64,20 @@ static void drm_cache_flush_clflush(struct page *pages[],
>                 drm_clflush_page(*pages++);
>         mb();
>  }
> +
> +static bool
> +drm_cache_should_clflush(unsigned long num_pages)
> +{
> +       const int cache_size = boot_cpu_data.x86_cache_size;
> +
> +       /* For now the algorithm simply checks if the number of pages to be
> +        * flushed is greater than the entire system cache. One could make the
> +        * function more aware of the actual system (ie. if SMP, how large is
> +        * the cache, CPU freq. etc. All those help to determine when to
> +        * wbinvd() */
> +       WARN_ON_ONCE(!cache_size);
> +       return !cache_size || num_pages < (cache_size >> 2);
> +}
>  #endif
>
>  void
> @@ -71,7 +85,7 @@ drm_clflush_pages(struct page *pages[], unsigned long num_pages)
>  {
>
>  #if defined(CONFIG_X86)
> -       if (cpu_has_clflush) {
> +       if (cpu_has_clflush && drm_cache_should_clflush(num_pages)) {
>                 drm_cache_flush_clflush(pages, num_pages);
>                 return;
>         }
> @@ -104,7 +118,7 @@ void
>  drm_clflush_sg(struct sg_table *st)
>  {
>  #if defined(CONFIG_X86)
> -       if (cpu_has_clflush) {
> +       if (cpu_has_clflush && drm_cache_should_clflush(st->nents)) {
>                 struct sg_page_iter sg_iter;
>
>                 mb();
> @@ -128,7 +142,7 @@ void
>  drm_clflush_virt_range(void *addr, unsigned long length)
>  {
>  #if defined(CONFIG_X86)
> -       if (cpu_has_clflush) {
> +       if (cpu_has_clflush && drm_cache_should_clflush(length / PAGE_SIZE)) {

If length isn't a multiple of page size, isn't this ignoring the
remainder? Should it be rounding length up to the next multiple of
PAGE_SIZE, like ROUND_UP_TO?