<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Wed, Apr 5, 2017 at 1:27 AM, Chris Wilson <span dir="ltr"><<a href="mailto:chris@chris-wilson.co.uk" target="_blank">chris@chris-wilson.co.uk</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Tue, Apr 04, 2017 at 07:21:38PM -0700, Jason Ekstrand wrote:<br> > Before, we were just looking at whether or not the user wanted us to<br> > wait and waiting on the BO. This instead makes us busy-loop on each<br> > query until it's available. This reduces some of the pipeline bubbles<br> > we were getting and improves performance of The Talos Principle on<br> > medium settings (where the GPU isn't overloaded) by around 20% on my<br> > SkyLake gt4.<br> <br> </span>Hmm. The kernel also spins, but it limits itself to only spining on the<br> active request and for a max of 2us within your process's timeslice.<br> The ioctl overhead is ~100ns in this case, cheaper than a call to<br> clock_gettime()! Looks like the advantage here is that you do not limit<br> yourself. A much simpler loop doing the same would be<br></blockquote><div><br></div><div>If clock_gettime() isn't gaining me anything, I'm very happy to drop it. It just makes things more complicated as you say. Am I better off just calling gem_busy in my loop?<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> while (true) {<br> if (query_is_available())<br> return VK_SUCCESS;<br> <br> if (!gem_busy())<br> return query_is_available() ? VK_SUCCESS : VK_NOT_READY;<br> <span class=""> }<br> <br> > ---<br> > src/intel/vulkan/genX_query.c | 72 ++++++++++++++++++++++++++++++<wbr>+++++++++----<br> > 1 file changed, 66 insertions(+), 6 deletions(-)<br> ><br> > diff --git a/src/intel/vulkan/genX_query.<wbr>c b/src/intel/vulkan/genX_query.<wbr>c<br> > index 7ea9404..ebf99d2 100644<br> > --- a/src/intel/vulkan/genX_query.<wbr>c<br> > +++ b/src/intel/vulkan/genX_query.<wbr>c<br> > @@ -131,6 +131,64 @@ cpu_write_query_result(void *dst_slot, VkQueryResultFlags flags,<br> > }<br> > }<br> ><br> > +#define NSEC_PER_SEC 1000000000<br> > +<br> > +static bool<br> > +query_is_available(struct anv_device *device, uint64_t *slot)<br> > +{<br> > + if (!device->info.has_llc)<br> > + __builtin_ia32_clflush(slot);<br> <br> </span>Make the target cacheable? Your query write will then do the cacheline<br> invalidation, but there's obviously a tradeoff depending on the frequency<br> of snooping.<br><div><div class="h5"></div></div></blockquote><div><br></div><div>I'd like to do that eventually, yes. I'm planning to make some driver-wide changes to use better maps and caching in the future. For now, this is the way the rest of the driver works.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="h5"> > +<br> > + return slot[0];<br> > +}<br> > +<br> > +static VkResult<br> > +wait_for_available(struct anv_device *device,<br> > + struct anv_query_pool *pool, uint64_t *slot)<br> > +{<br> > + while (true) {<br> > + struct timespec start;<br> > + clock_gettime(CLOCK_MONOTONIC, &start);<br> > +<br> > + while (true) {<br> > + if (!device->info.has_llc)<br> > + __builtin_ia32_clflush(slot);<br> > +<br> > + if (query_is_available(device, slot))<br> > + return VK_SUCCESS;<br> > +<br> > + struct timespec current;<br> > + clock_gettime(CLOCK_MONOTONIC, &current);<br> > +<br> > + if (current.tv_nsec < start.tv_nsec) {<br> > + current.tv_nsec += NSEC_PER_SEC;<br> > + current.tv_sec -= 1;<br> > + }<br> > +<br> > + /* If we've been looping for more than 1 ms, break out of the busy<br> > + * loop and ask the kernel if the buffer is actually busy.<br> > + */<br> > + if (current.tv_sec > start.tv_sec ||<br> > + current.tv_nsec - start.tv_nsec > 1000000)<br> > + break;<br> > + }<br> > +<br> > + VkResult result = anv_device_wait(device, &pool->bo, 0);<br> <br> </div></div>Using the busy-ioctl is even cheaper than wait(0).<br></blockquote><div><br></div><div>Sure. I can add an anv_device_bo_busy or just make anv_device_wait with a timeout of 0 use the busy ioctl.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5"> > + switch (result) {<br> > + case VK_SUCCESS:<br> > + /* The BO is no longer busy. If we haven't seen availability yet,<br> > + * then we never will.<br> > + */<br> > + return query_is_available(device, slot) ? VK_SUCCESS : VK_NOT_READY;<br> > + case VK_TIMEOUT:<br> > + /* The BO is still busy, keep waiting. */<br> > + continue;<br> > + default:<br> > + return result;<br> > + }<br> > + }<br> > +}<br> <br> </div></div><span class="HOEnZb"><font color="#888888">--<br> Chris Wilson, Intel Open Source Technology Centre<br> </font></span></blockquote></div><br></div></div>