[Mesa-dev] [PATCH] anv/query: Busy-wait for available query entries

Wed Apr 5 08:27:15 UTC 2017

On Tue, Apr 04, 2017 at 07:21:38PM -0700, Jason Ekstrand wrote:
> Before, we were just looking at whether or not the user wanted us to
> wait and waiting on the BO.  This instead makes us busy-loop on each
> query until it's available.  This reduces some of the pipeline bubbles
> we were getting and improves performance of The Talos Principle on
> medium settings (where the GPU isn't overloaded) by around 20% on my
> SkyLake gt4.

Hmm. The kernel also spins, but it limits itself to only spining on the
active request and for a max of 2us within your process's timeslice.
The ioctl overhead is ~100ns in this case, cheaper than a call to
clock_gettime()! Looks like the advantage here is that you do not limit
yourself. A much simpler loop doing the same would be

	while (true) {
		if (query_is_available())
			return VK_SUCCESS;

		if (!gem_busy())
			return query_is_available() ? VK_SUCCESS : VK_NOT_READY;
	}

> ---
>  src/intel/vulkan/genX_query.c | 72 +++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 66 insertions(+), 6 deletions(-)
> 
> diff --git a/src/intel/vulkan/genX_query.c b/src/intel/vulkan/genX_query.c
> index 7ea9404..ebf99d2 100644
> --- a/src/intel/vulkan/genX_query.c
> +++ b/src/intel/vulkan/genX_query.c
> @@ -131,6 +131,64 @@ cpu_write_query_result(void *dst_slot, VkQueryResultFlags flags,
>     }
>  }
>  
> +#define NSEC_PER_SEC 1000000000
> +
> +static bool
> +query_is_available(struct anv_device *device, uint64_t *slot)
> +{
> +   if (!device->info.has_llc)
> +      __builtin_ia32_clflush(slot);

Make the target cacheable? Your query write will then do the cacheline
invalidation, but there's obviously a tradeoff depending on the frequency
of snooping.

> +
> +   return slot[0];
> +}
> +
> +static VkResult
> +wait_for_available(struct anv_device *device,
> +                   struct anv_query_pool *pool, uint64_t *slot)
> +{
> +   while (true) {
> +      struct timespec start;
> +      clock_gettime(CLOCK_MONOTONIC, &start);
> +
> +      while (true) {
> +         if (!device->info.has_llc)
> +            __builtin_ia32_clflush(slot);
> +
> +         if (query_is_available(device, slot))
> +            return VK_SUCCESS;
> +
> +         struct timespec current;
> +         clock_gettime(CLOCK_MONOTONIC, &current);
> +
> +         if (current.tv_nsec < start.tv_nsec) {
> +            current.tv_nsec += NSEC_PER_SEC;
> +            current.tv_sec -= 1;
> +         }
> +
> +         /* If we've been looping for more than 1 ms, break out of the busy
> +          * loop and ask the kernel if the buffer is actually busy.
> +          */
> +         if (current.tv_sec > start.tv_sec ||
> +             current.tv_nsec - start.tv_nsec > 1000000)
> +            break;
> +      }
> +
> +      VkResult result = anv_device_wait(device, &pool->bo, 0);

Using the busy-ioctl is even cheaper than wait(0).

> +      switch (result) {
> +      case VK_SUCCESS:
> +         /* The BO is no longer busy.  If we haven't seen availability yet,
> +          * then we never will.
> +          */
> +         return query_is_available(device, slot) ? VK_SUCCESS : VK_NOT_READY;
> +      case VK_TIMEOUT:
> +         /* The BO is still busy, keep waiting. */
> +         continue;
> +      default:
> +         return result;
> +      }
> +   }
> +}

-- 
Chris Wilson, Intel Open Source Technology Centre