<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Thu, Nov 3, 2016 at 1:53 AM, Chris Wilson <span dir="ltr"><<a href="mailto:chris@chris-wilson.co.uk" target="_blank">chris@chris-wilson.co.uk</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Wed, Nov 02, 2016 at 01:58:43PM -0700, Jason Ekstrand wrote:<br>
> +/**<br>
> + * This function applies the relocation for a command buffer and writes the<br>
> + * actual addresses into the buffers as per what we were told by the kernel on<br>
> + * the previous execbuf2 call. This should be safe to do because, for each<br>
> + * relocated address, we have two cases:<br>
> + *<br>
> + * 1) The target BO is inactive (as seen by the kernel). In this case, it is<br>
> + * not in use by the GPU so updating the address is 100% ok. It won't be<br>
> + * in-use by the GPU (from our context) again until the next execbuf2<br>
> + * happens. If the kernel decides to move it in the next execbuf2, it<br>
> + * will have to do the relocations itself, but that's ok because it should<br>
> + * have all of the information needed to do so.<br>
> + *<br>
> + * 2) The target BO is active (as seen by the kernel). In this case, it<br>
> + * hasn't moved since the last execbuffer2 call because GTT shuffling<br>
> + * *only* happens inside the execbuffer2 ioctl. Since the target BO<br>
> + * hasn't moved, our anv_bo::offset exactly matches the BO's GTT address<br>
> + * and the relocated value we are writing into the BO will be the same as<br>
> + * the value that is already there.<br>
<br>
</span>This is slightly misleading.<br>
<br>
"In this case, it hasn't moved since the last execbuffer2 call because<br>
GTT shuffling *only* happens when the BO is idle. (From our perspective,<br>
it only happens inside the execbuffer2 ioctl, but the shuffling may be<br>
triggered by another ioctl, with full-ppgtt this is limited to only<br>
execbuffer2 ioctls on the same context, or memory pressure.)"<br></blockquote><div><br></div><div>Right. Without full ppgtt, things may get evicted because of another process but that means that a) the buffer isn't busy at that time and b) the kernel will know it moved something around and do the relocations anyway.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
So now, the slightly more opened ended question...<br>
<br>
Addresses are local to a context. From my understanding the<br>
VkQueueFamily corresponds to the render engine, or media engine, or<br>
blitter (or maybe compute on top of render?)</blockquote><div><br></div><div>Yes, that's correct.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> and the application may<br>
create multiple VkQueues in each family. (Aiui, at the moment this is<br>
limited to 1 queue.)<br></blockquote><div><br></div><div>Yes, we do limit it to 1 queue and will continue to do so until preeption gets sorted out. If we support multiple queues they have to be able to truly run in parallel with VkSemaphore dependencies between them and possibly going both ways. We can't support a blit queue because the blitter isn't capable of handling things like W-tiling which would need to be supported on a DMA queue.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
The reloc tables must be local to each queue (since the addresses<br>
contained within only match the local context). What I am more worried<br>
about is whether state may be shared between contexts, and whether<br>
active state have the same target address in different contexts (i.e.<br>
the requirement above that you only need to relocate idle state). Whilst<br>
you only have one context, it is a moot point - I'd just like to see<br>
some acknowledgement that surfaces may have different addresses in<br>
different contexts and how you want to handle that.<br></blockquote><div><br></div><div>Once we get into the world of multiple queues, I think we'll be using softpin or SVM and relocations will be a thing of the past. Thanks for the heads up though!<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="h5">
> + *<br>
> + * There is also a possibility that the target BO is active but the exact<br>
> + * RENDER_SURFACE_STATE object we are writing the relocation into isn't in<br>
> + * use. In this case, the address currently in the RENDER_SURFACE_STATE<br>
> + * may be stale but it's still safe to write the relocation because that<br>
> + * particular RENDER_SURFACE_STATE object isn't in-use by the GPU and<br>
> + * won't be until the next execbuf2 call.<br>
> + *<br>
> + * By doing relocations on the CPU, we can tell the kernel that it doesn't<br>
> + * need to bother. We want to do this because the surface state buffer is<br>
> + * used by every command buffer so, if the kernel does the relocations, it<br>
> + * will always be busy and the kernel will always stall. This is also<br>
> + * probably the fastest mechanism for doing relocations since the kernel would<br>
> + * have to make a full copy of all the relocations lists.<br>
> + */<br>
> +static bool<br>
> +relocate_cmd_buffer(struct anv_cmd_buffer *cmd_buffer)<br>
> +{<br>
> + static int userspace_relocs = -1;<br>
> + if (userspace_relocs < 0)<br>
> + userspace_relocs = env_var_as_boolean("ANV_<wbr>USERSPACE_RELOCS", true);<br>
> + if (!userspace_relocs)<br>
> + return false;<br>
> +<br>
> + /* First, we have to check to see whether or not we can even do the<br>
> + * relocation. New buffers which have never been submitted to the kernel<br>
> + * don't have a valid offset so we need to let the kernel do relocations so<br>
> + * that we can get offsets for them. On future execbuf2 calls, those<br>
> + * buffers will have offsets and we will be able to skip relocating.<br>
> + * Invalid offsets are indicated by anv_bo::offset == (uint64_t)-1.<br>
> + */<br>
<br>
</div></div>Something to consider is just randomly assigning an address and using<br>
it. The kernel will relocate if it wants to, but in future the kernel<br>
will use your address as a hint and try to bind the object there (if<br>
that space is available). Like softpinning, but without failing with<br>
-EINVAL if the space is not available. Or you may of course do full<br>
client side address allocation under full-ppgtt (the hybrid scheme is<br>
still useful elsewhere). The advantage is avoiding the stall on first<br>
state use. The disadvantage is that until that patch lands, you walk the<br>
relocations for no gain (otoh, since the kernel will take its own sweet<br>
time doing the same, the inefficiency here will hopefully be negligible.)<span class=""><br></span></blockquote><div><br></div><div>Does that really work? My reading of the kernel sources indicates that NO_RELOC means the kernel will just assume that they match what it gave you in the last execbuf2. It never checks that the offsets match unless it had to move something in the GTT. If, on the other hand, you mark the object as PINNED and the offsets don't match, it will get flagged in eb_vma_misplaced, remapped, and the relocations *will* trigger.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">
> + bool offsets_changed = false;<br>
> + for (uint32_t i = 0; i < cmd_buffer->execbuf2.bo_count; i++) {<br>
> + struct anv_bo *bo = cmd_buffer->execbuf2.bos[i];<br>
> +<br>
> + if (bo->offset == (uint64_t)-1)<br>
> + return false;<br>
> +<br>
> + if (cmd_buffer->execbuf2.objects[<wbr>i].offset != bo->offset)<br>
> + offsets_changed = true;<br>
> + }<br>
> +<br>
> + /* Since surface states are shared between command buffers and we don't<br>
> + * know what order they will be submitted to the kernel, we don't know<br>
> + * what address is actually written in the surface state object at any<br>
> + * given time. The only option is to always relocate them.<br>
> + */<br>
> + anv_reloc_list_apply(&cmd_<wbr>buffer->surface_relocs,<br>
> + cmd_buffer->device,<br>
> + &cmd_buffer->device-><a href="http://surface_state_block_pool.bo" rel="noreferrer" target="_blank">surface_<wbr>state_block_pool.bo</a>,<br>
> + true /* always relocate surface states */);<br>
<br>
</span>Worth keeping an atomic_t surface_state_reloc_serial ? (later!)<br></blockquote><div><br></div><div>Possibly... But my gut tells me that the case where we can actually re-use the addresses in the surface state buffer is fairly rare.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="h5">
> + if (offsets_changed) {<br>
> + /* Since we own all of the batch buffers, we know what values are stored<br>
> + * in the relocated addresses and only have to update them if the<br>
> + * offsets have changed.<br>
> + */<br>
> + struct anv_batch_bo **bbo;<br>
> + u_vector_foreach(bbo, &cmd_buffer->seen_bbos) {<br>
> + anv_reloc_list_apply(&(*bbo)-><wbr>relocs,<br>
> + cmd_buffer->device, &(*bbo)->bo, false);<br>
> + }<br>
> + }<br>
> +<br>
> + for (uint32_t i = 0; i < cmd_buffer->execbuf2.bo_count; i++) {<br>
> + struct anv_bo *bo = cmd_buffer->execbuf2.bos[i];<br>
> +<br>
> + cmd_buffer->execbuf2.objects[<wbr>i].offset = bo->offset;<br>
> + }<br>
> +<br>
> + return true;<br>
> +}<br>
> +<br>
> VkResult<br>
> anv_cmd_buffer_execbuf(struct anv_device *device,<br>
> struct anv_cmd_buffer *cmd_buffer)<br>
> {<br>
> - /* Since surface states are shared between command buffers and we don't<br>
> - * know what order they will be submitted to the kernel, we don't know what<br>
> - * address is actually written in the surface state object at any given<br>
> - * time. The only option is to set a bogus presumed offset and let<br>
> - * relocations do their job.<br>
> - */<br>
> - for (size_t i = 0; i < cmd_buffer->surface_relocs.<wbr>num_relocs; i++)<br>
> - cmd_buffer->surface_relocs.<wbr>relocs[i].presumed_offset = -1;<br>
> + /* Make a copy of the execbuffer2 so we can alter the flags field */<br>
> + struct drm_i915_gem_execbuffer2 execbuf = cmd_buffer->execbuf2.execbuf;<br>
> +<br>
> + if (relocate_cmd_buffer(cmd_<wbr>buffer)) {<br>
> + /* If we were able to successfully relocate everything, tell the kernel<br>
> + * that it can skip doing relocations.<br>
> + */<br>
<br>
</div></div>/* The requirement for using NO_RELOC is:<br>
*<br>
* The addresses written in the objects must match the corresponding<br>
* reloc.presumed_offset which in turn must match the corresponding<br>
* execobject.offset.<br>
*<br>
* To avoid stalling, execobject.offset should match the current<br>
* address of that object within the active context.<br></blockquote><div><br></div><div>Actually, from my reading of the i915 sources, the kernel doesn't check this. It's your responsibility to ensure that they match the addresses returned by the previous execbuf2 ioctl.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
*<br>
<span class=""> * If we were able to successfully relocate everything, tell the kernel<br>
</span><span class=""> * that it can skip doing relocations.<br>
</span><div><div class="h5"> */<br></div></div></blockquote><div><br></div><div>I copy+paste your comment (with my correction).<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="h5">
> + execbuf.flags |= I915_EXEC_NO_RELOC;<br>
> + } else {<br>
> + /* In the case where we fall back to doing kernel relocations, we need<br>
> + * to ensure that the relocation list is valid. All relocations on the<br>
> + * batch buffers are already valid and kept up-to-date. Since surface<br>
> + * states are shared between command buffers and we don't know what<br>
> + * order they will be submitted to the kernel, we don't know what<br>
> + * address is actually written in the surface state object at any given<br>
> + * time. The only option is to set a bogus presumed offset and let the<br>
> + * kernel relocate them.<br>
> + */<br>
> + for (size_t i = 0; i < cmd_buffer->surface_relocs.<wbr>num_relocs; i++)<br>
> + cmd_buffer->surface_relocs.<wbr>relocs[i].presumed_offset = -1;<br>
> + }<br>
><br>
> - return anv_device_execbuf(device, &cmd_buffer->execbuf2.execbuf,<br>
> - cmd_buffer->execbuf2.bos);<br>
> + return anv_device_execbuf(device, &execbuf, cmd_buffer->execbuf2.bos);<br>
> }<br>
> diff --git a/src/intel/vulkan/anv_device.<wbr>c b/src/intel/vulkan/anv_device.<wbr>c<br>
> index c40598c..a650212 100644<br>
> --- a/src/intel/vulkan/anv_device.<wbr>c<br>
> +++ b/src/intel/vulkan/anv_device.<wbr>c<br>
> @@ -1097,6 +1097,8 @@ VkResult anv_QueueSubmit(<br>
> struct anv_device *device = queue->device;<br>
> VkResult result = VK_SUCCESS;<br>
<br>
</div></div>/* Take a lock on all shared surface_state in order to perform relocations. */<br>
> + pthread_mutex_lock(&device-><wbr>mutex);<br>
<br>
I think locking inside Vulkan is so rare, that it is worth explaining :)<br></blockquote><div><br></div><div>Fair enough.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<span class="HOEnZb"><font color="#888888">-Chris<br>
<br>
--<br>
Chris Wilson, Intel Open Source Technology Centre<br>
</font></span></blockquote></div><br></div></div>