[Mesa-dev] [PATCH 09/17] mesa/st: add support for waiting for semaphore objects

Fri Nov 3 20:03:16 UTC 2017

Hi Andres,

On 03.11.2017 18:36, Andres Rodriguez wrote:
> On 2017-11-03 05:17 AM, Nicolai Hähnle wrote:
>> On 02.11.2017 04:57, Andres Rodriguez wrote:
>>> Bits to implement ServerWaitSemaphoreObject/ServerSignalSemaphoreObject
>>>
>>> Signed-off-by: Andres Rodriguez <andresx7 at gmail.com>
>>> ---
>>>   src/mesa/state_tracker/st_cb_semaphoreobjects.c | 28 
>>> +++++++++++++++++++++++++
>>>   1 file changed, 28 insertions(+)
>>>
>>> diff --git a/src/mesa/state_tracker/st_cb_semaphoreobjects.c 
>>> b/src/mesa/state_tracker/st_cb_semaphoreobjects.c
>>> index 47ece47..4cff3fd 100644
>>> --- a/src/mesa/state_tracker/st_cb_semaphoreobjects.c
>>> +++ b/src/mesa/state_tracker/st_cb_semaphoreobjects.c
>>> @@ -1,5 +1,7 @@
>>> +
>>>   #include "main/imports.h"
>>>   #include "main/mtypes.h"
>>> +#include "main/context.h"
>>>   #include "main/externalobjects.h"
>>> @@ -47,10 +49,36 @@ st_import_semaphoreobj_fd(struct gl_context *ctx,
>>>   #endif
>>>   }
>>> +static void
>>> +st_server_wait_semaphore(struct gl_context *ctx,
>>> +                         struct gl_semaphore_object *semObj)
>>> +{
>>> +   struct st_semaphore_object *st_obj = st_semaphore_object(semObj);
>>> +   struct st_context *st = st_context(ctx);
>>> +   struct pipe_context *pipe = st->pipe;
>>> +
>>> +   _mesa_flush(ctx);
>>
>> What's the justification of this flush?
> 
> Thanks for all the review feedback :)
> 
> Technically according to the spec we shouldn't need to flush here. But 
> the flush is required due to how amdgpu handles syncobjs.
> 
> For amdgpu, the wait and signal operations happens on a submission 
> boundary. I.e, we can say "wait for this syncobj before executing this 
> submission", or "signal this syncobj after this submission completes". 
> But we can't really introduce a wait/signal in the middle of a 
> commandstream as the spec requires us to do.

This is a good point. There are actually two parts to this, a 
performance part and a correctness part.

The performance part is that we may want to start previously queued work 
already and not force it to wait on an unnecessary dependency.

The correctness part is that *if* the currently pending work is already 
signaling a semaphore, then flushing it may be required to avoid a 
deadlock as in:

   Context 1               Context 2
   ---------               ---------
   Signal semaphore A
                           Wait for semaphore A
                           Signal sempahore B
   Wait for semaphore B

At least for amdgpu (and really any implementation that cannot context 
switch freely on the GPU) there must be a flush between the two 
operations in Context 1, though

1. Whether we *have* to add a flush automatically depends on the exact 
semantics of SignalSemaphore, and
2. If we do have to add a flush automatically, it's up to us whether we 
want to do it as part of the signaling or before the wait (or both).

Based on typical GL semantics I kind of suspect that we *don't* have to 
flush automatically, but I'm sure you've read the EXT_external_objects 
spec more closely than I have :)

> This flush (and the one in st_server_signal_semaphore) exist to adjust 
> the submission boundaries. If we align the submission boundary with the 
> semaphore operation boundary we can accurately emulate the timing of the 
> semaphore operations.
> 
> In this case, when st_server_wait_semaphore() is called, the user 
> expects that any work queued up until this point will not wait for the 
> semaphore. Any work submitted after the wait call returns should wait 
> for the semaphore to signal. Hence, we call flush so that the amdgpu 
> winsys will submit all the current work. Then we queue up the syncobj 
> for future submissions to wait on.
> 
> In the signal case, we flush so that the syncobj will signal when the 
> current work finishes (instead of at "current work + N"). That way we 
> don't have any unnecessary latency.

Flushing immediately may not be required for correctness though. This 
needs to be documented explicitly.

> However, I'm not 100% certain my approach here is the best solution. My 
> main concern is that an amdgpu specific detail is surfaced in what 
> should be a pipe agnostic layer. If you have any recommendations on how 
> to move this code into the pipe layer, or achieve the same results with 
> a different strategy I'd love to take the advice.
> 
>  From the performance point of view, I'm not as concerned. Wait/signal 
> tend to be aligned at frame boundaries so flushing there shouldn't be a 
> big deal. But I do think that improving the code structure from my 
> current iteration would have value.

If we flush immediately, there's the possibility that an application 
ends up flushing for the signal, then does some silly minor housekeeping 
jobs, and then flushes again for an end-of-frame flush.

>>> +   pipe->semobj_wait(pipe, st_obj->semaphore);
>>> +}
>>> +
>>> +static void
>>> +st_server_signal_semaphore(struct gl_context *ctx,
>>> +                           struct gl_semaphore_object *semObj)
>>> +{
>>> +   struct st_semaphore_object *st_obj = st_semaphore_object(semObj);
>>> +   struct st_context *st = st_context(ctx);
>>> +   struct pipe_context *pipe = st->pipe;
>>> +
>>
>> You have to flush state-tracker internal state here. This means 
>> FLUSH_VERTICES, st_flush_bitmap, possibly others. I don't think we 
>> have this factored out well yet, but see below.
> 
> Thanks for pointing this out, this actually gives me an idea for getting 
> rid of the full flushes, see below.
> 
>>
>>
>>> +   pipe->semobj_signal(pipe, st_obj->semaphore);
>>> +   _mesa_flush(ctx);
>>
>> What's the justification of this flush? Does EXT_external_objects 
>> contain any language to guarantee that SignalSemaphoreEXT will "finish 
>> in finite time"? If no, the flush should probably not be there. If 
>> yes, please add a spec quote.
> 
> See above for the flush reasoning.
> 
>>
>> Furthermore, this whole code section is a strong hint that merging 
>> fences and semaphores at the Gallium level is indeed a good idea. 
>> Basically, the idea is that you'd end up calling
>>
>>      pipe->flush(pipe, &st_obj->semaphore, flags);
>>
>> where flags is PIPE_FLUSH_DEFERRED | PIPE_FLUSH_REUSE_FENCE or 
>> PIPE_FLUSH_ASYNC | PIPE_FLUSH_REUSE_FENCE, depending on how the 
>> previous paragraph is resolved. In one of my recent series that 
>> includes deferred fences for threaded contexts, I'm already doing the 
>> equivalent of adding a "PIPE_FLUSH_REUSE_FENCE". We could formalize 
>> this as a more broadly applicable feature in the Gallium interface.
>>
> 
> Thanks for pointing out PIPE_FLUSH_REUSE_FENCE, I wasn't aware of this 
> flag.

It actually doesn't exist yet :)

I was thinking of the patch 
https://patchwork.freedesktop.org/patch/184160/, which adds a 
TC_FLUSH_ASYNC only for the purpose of threaded Gallium. The flag 
changes the semantics such that pipe_context::flush is supposed to use a 
newly allocated fence object which is already passed in.

This is not quite the same as I would expect from a 
PIPE_FLUSH_REUSE_FENCE. The differences are:

- PIPE_FLUSH_REUSE_FENCE actually resets a previously used fence
- TC_FLUSH_ASYNC essentially splits the fence signaling into a "top 
half" that happens in the API thread and a "bottom half" that happens in 
the driver thread

But they're similar in many ways.

> My reasoning for keeping semaphore operations separate is that a 
> different pipe implementation might be able to emit a wait/signal packet 
> into the middle of the commandstream, with no flushing required.

At least for signal packets, this kind of already happens (with lots of 
limitations due to how amdgpu works) for PIPE_FLUSH_DEFERRRED | 
PIPE_FLUSH_{TOP,BOTTOM}_OF_PIPE which is added in that same patch series :)

> However, that argument has a giant hole in it. I have a _mesa_flush() 
> which completely negates any benefits of a "pipe agnostic" 
> implementation. So in this case, as you suggested, I think merging 
> semaphores with re-usable fences would be a good idea.
> 
> I do have one alternative proposal try to keep the option of 
> mid-commandstream semaphores open. How about something like:
> 
> flush_state_tracker_internal_state() {
>      flush_vertices();

As Marek correctly pointed out, FLUSH_VERTICES should really be called 
from mesa/main.

>      flush_bitmaps();
> }
> 
> st_server_wait_semaphore() {
>      flush_state_tracker_internal_state();

I don't think this is needed.

>      pipe->semobj_wait(pipe, st_obj->semaphore);
> 
>      // The amdgpu semobj_wait() implementation would
>      // then be responsible for adjusting the submission
>      // windows internally, i.e. calling flush()
> }
> 
> 
> st_server_signal_semaphore() {
>      flush_state_tracker_internal_state();
>      pipe->semobj_signal(pipe, st_obj->semaphore);
> 
>      // The amdgpu semobj_signal() implementation would
>      // then be responsible for adjusting the submission
>      // windows internally, i.e. calling flush()
> }
> 
> Let me know what you think of this slightly more agnostic implementation.

I like it! The semobj_wait would translate to a fence_server_sync, and 
the semobj_signal would translate to a pipe->flush. (And if we do the 
merging, we should really consider renaming fences to semaphores in the 
long run.)

Depending on what the required semantics of signaling and/or latency 
considerations are, we'd use either ASYNC or DEFERRED flushes. For 
DEFERRED flushes, the driver would have a choice of flushing immediately 
or delaying based on whichever heuristics. Similarly for waiting, the 
driver would either have to flush immediately in some situations (if 
there are unflushed deferred fences) or could make a choice (e.g. based 
on a heuristic for how much work is already scheduled).

BTW, about those unflushed deferred fences: if you take a look at 
st_cb_syncobj.c, you'll see that GL sync objects *don't* force a flush 
(they use deferred fences), and radeonsi does not flush in 
fence_server_sync either (although actually it currently simply does 
nothing at all, which may change), so you can get into the deadlock 
described above with GL Sync Objects unless you flush explicitly, and 
AFAIU that's consistent with the spec.

See also the discussion on this patch: 
https://patchwork.freedesktop.org/patch/184176/

I've updated the branch at 
https://cgit.freedesktop.org/~nh/mesa/log/?h=fences-threads-ddebug with 
a combination of related patch series including the one I linked to in 
this mail.

> Sorry for the long reply!

There's nothing quite like having a cup of tea and thinking about 
multi-threading :D

Cheers,
Nicolai

> 
> Regards,
> Andres
> 
> 
>> Cheers,
>> Nicolai
>>
>>
>>> +}
>>> +
>>>   void
>>>   st_init_semaphoreobject_functions(struct dd_function_table *functions)
>>>   {
>>>      functions->NewSemaphoreObject = st_semaphoreobj_alloc;
>>>      functions->DeleteSemaphoreObject = st_semaphoreobj_free;
>>>      functions->ImportSemaphoreFd = st_import_semaphoreobj_fd;
>>> +   functions->ServerWaitSemaphoreObject = st_server_wait_semaphore;
>>> +   functions->ServerSignalSemaphoreObject = st_server_signal_semaphore;
>>>   }
>>>
>>
>>

-- 
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.