[Mesa-dev] [PATCH] anv/query: Busy-wait for available query entries

Wed Apr 5 11:36:54 UTC 2017

Hi,

On 05.04.2017 11:27, Chris Wilson wrote:
> On Tue, Apr 04, 2017 at 07:21:38PM -0700, Jason Ekstrand wrote:
>> Before, we were just looking at whether or not the user wanted us to
>> wait and waiting on the BO.  This instead makes us busy-loop on each
>> query until it's available.  This reduces some of the pipeline bubbles
>> we were getting and improves performance of The Talos Principle on
>> medium settings (where the GPU isn't overloaded) by around 20% on my
>> SkyLake gt4.

I get similar results.  Depending on which part of Benchmark one is 
running, and whether one is using (Ubuntu default) powersave [1] or 
performance governor with Mesa master, the improvement is 10-25%.

Even with everything set to Ultra, there's small improvement.

Tested-by: Eero Tamminen <eero.t.tamminen at intel.com>

	- Eero

[1]  With powersave scheduler, without this patch kernel doesn't even 
request GPU to run at full speed.  This was with Ubuntu 4.4 + 4.7 i915 
backport kernel, but using latest drm-tip didn't seem to change things 
significantly.

> Hmm. The kernel also spins, but it limits itself to only spining on the
> active request and for a max of 2us within your process's timeslice.
> The ioctl overhead is ~100ns in this case, cheaper than a call to
> clock_gettime()! Looks like the advantage here is that you do not limit
> yourself. A much simpler loop doing the same would be
>
> 	while (true) {
> 		if (query_is_available())
> 			return VK_SUCCESS;
> 		
> 		if (!gem_busy())
> 			return query_is_available() ? VK_SUCCESS : VK_NOT_READY;
> 	}
>
>> ---
>>  src/intel/vulkan/genX_query.c | 72 +++++++++++++++++++++++++++++++++++++++----
>>  1 file changed, 66 insertions(+), 6 deletions(-)
>>
>> diff --git a/src/intel/vulkan/genX_query.c b/src/intel/vulkan/genX_query.c
>> index 7ea9404..ebf99d2 100644
>> --- a/src/intel/vulkan/genX_query.c
>> +++ b/src/intel/vulkan/genX_query.c
>> @@ -131,6 +131,64 @@ cpu_write_query_result(void *dst_slot, VkQueryResultFlags flags,
>>     }
>>  }
>>
>> +#define NSEC_PER_SEC 1000000000
>> +
>> +static bool
>> +query_is_available(struct anv_device *device, uint64_t *slot)
>> +{
>> +   if (!device->info.has_llc)
>> +      __builtin_ia32_clflush(slot);
>
> Make the target cacheable? Your query write will then do the cacheline
> invalidation, but there's obviously a tradeoff depending on the frequency
> of snooping.
>
>> +
>> +   return slot[0];
>> +}
>> +
>> +static VkResult
>> +wait_for_available(struct anv_device *device,
>> +                   struct anv_query_pool *pool, uint64_t *slot)
>> +{
>> +   while (true) {
>> +      struct timespec start;
>> +      clock_gettime(CLOCK_MONOTONIC, &start);
>> +
>> +      while (true) {
>> +         if (!device->info.has_llc)
>> +            __builtin_ia32_clflush(slot);
>> +
>> +         if (query_is_available(device, slot))
>> +            return VK_SUCCESS;
>> +
>> +         struct timespec current;
>> +         clock_gettime(CLOCK_MONOTONIC, &current);
>> +
>> +         if (current.tv_nsec < start.tv_nsec) {
>> +            current.tv_nsec += NSEC_PER_SEC;
>> +            current.tv_sec -= 1;
>> +         }
>> +
>> +         /* If we've been looping for more than 1 ms, break out of the busy
>> +          * loop and ask the kernel if the buffer is actually busy.
>> +          */
>> +         if (current.tv_sec > start.tv_sec ||
>> +             current.tv_nsec - start.tv_nsec > 1000000)
>> +            break;
>> +      }
>> +
>> +      VkResult result = anv_device_wait(device, &pool->bo, 0);
>
> Using the busy-ioctl is even cheaper than wait(0).
>
>> +      switch (result) {
>> +      case VK_SUCCESS:
>> +         /* The BO is no longer busy.  If we haven't seen availability yet,
>> +          * then we never will.
>> +          */
>> +         return query_is_available(device, slot) ? VK_SUCCESS : VK_NOT_READY;
>> +      case VK_TIMEOUT:
>> +         /* The BO is still busy, keep waiting. */
>> +         continue;
>> +      default:
>> +         return result;
>> +      }
>> +   }
>> +}
>