[Mesa-dev] [PATCH] anv/query: Busy-wait for available query entries
Eero Tamminen
eero.t.tamminen at intel.com
Wed Apr 5 11:36:54 UTC 2017
Hi,
On 05.04.2017 11:27, Chris Wilson wrote:
> On Tue, Apr 04, 2017 at 07:21:38PM -0700, Jason Ekstrand wrote:
>> Before, we were just looking at whether or not the user wanted us to
>> wait and waiting on the BO. This instead makes us busy-loop on each
>> query until it's available. This reduces some of the pipeline bubbles
>> we were getting and improves performance of The Talos Principle on
>> medium settings (where the GPU isn't overloaded) by around 20% on my
>> SkyLake gt4.
I get similar results. Depending on which part of Benchmark one is
running, and whether one is using (Ubuntu default) powersave [1] or
performance governor with Mesa master, the improvement is 10-25%.
Even with everything set to Ultra, there's small improvement.
Tested-by: Eero Tamminen <eero.t.tamminen at intel.com>
- Eero
[1] With powersave scheduler, without this patch kernel doesn't even
request GPU to run at full speed. This was with Ubuntu 4.4 + 4.7 i915
backport kernel, but using latest drm-tip didn't seem to change things
significantly.
> Hmm. The kernel also spins, but it limits itself to only spining on the
> active request and for a max of 2us within your process's timeslice.
> The ioctl overhead is ~100ns in this case, cheaper than a call to
> clock_gettime()! Looks like the advantage here is that you do not limit
> yourself. A much simpler loop doing the same would be
>
> while (true) {
> if (query_is_available())
> return VK_SUCCESS;
>
> if (!gem_busy())
> return query_is_available() ? VK_SUCCESS : VK_NOT_READY;
> }
>
>> ---
>> src/intel/vulkan/genX_query.c | 72 +++++++++++++++++++++++++++++++++++++++----
>> 1 file changed, 66 insertions(+), 6 deletions(-)
>>
>> diff --git a/src/intel/vulkan/genX_query.c b/src/intel/vulkan/genX_query.c
>> index 7ea9404..ebf99d2 100644
>> --- a/src/intel/vulkan/genX_query.c
>> +++ b/src/intel/vulkan/genX_query.c
>> @@ -131,6 +131,64 @@ cpu_write_query_result(void *dst_slot, VkQueryResultFlags flags,
>> }
>> }
>>
>> +#define NSEC_PER_SEC 1000000000
>> +
>> +static bool
>> +query_is_available(struct anv_device *device, uint64_t *slot)
>> +{
>> + if (!device->info.has_llc)
>> + __builtin_ia32_clflush(slot);
>
> Make the target cacheable? Your query write will then do the cacheline
> invalidation, but there's obviously a tradeoff depending on the frequency
> of snooping.
>
>> +
>> + return slot[0];
>> +}
>> +
>> +static VkResult
>> +wait_for_available(struct anv_device *device,
>> + struct anv_query_pool *pool, uint64_t *slot)
>> +{
>> + while (true) {
>> + struct timespec start;
>> + clock_gettime(CLOCK_MONOTONIC, &start);
>> +
>> + while (true) {
>> + if (!device->info.has_llc)
>> + __builtin_ia32_clflush(slot);
>> +
>> + if (query_is_available(device, slot))
>> + return VK_SUCCESS;
>> +
>> + struct timespec current;
>> + clock_gettime(CLOCK_MONOTONIC, ¤t);
>> +
>> + if (current.tv_nsec < start.tv_nsec) {
>> + current.tv_nsec += NSEC_PER_SEC;
>> + current.tv_sec -= 1;
>> + }
>> +
>> + /* If we've been looping for more than 1 ms, break out of the busy
>> + * loop and ask the kernel if the buffer is actually busy.
>> + */
>> + if (current.tv_sec > start.tv_sec ||
>> + current.tv_nsec - start.tv_nsec > 1000000)
>> + break;
>> + }
>> +
>> + VkResult result = anv_device_wait(device, &pool->bo, 0);
>
> Using the busy-ioctl is even cheaper than wait(0).
>
>> + switch (result) {
>> + case VK_SUCCESS:
>> + /* The BO is no longer busy. If we haven't seen availability yet,
>> + * then we never will.
>> + */
>> + return query_is_available(device, slot) ? VK_SUCCESS : VK_NOT_READY;
>> + case VK_TIMEOUT:
>> + /* The BO is still busy, keep waiting. */
>> + continue;
>> + default:
>> + return result;
>> + }
>> + }
>> +}
>
More information about the mesa-dev
mailing list