Threaded submission & semaphore sharing

Fri Aug 2 09:41:08 UTC 2019

Hey David,

On 02/08/2019 12:11, zhoucm1 wrote:
>
> Hi Lionel,
>
> For binary semaphore, I guess every one will think application will 
> guarantee wait is behind the signal, whenever the semaphore is shared 
> or used in internal-process.
>
> I think below two options can fix your problem:
>
> a. Can we extend vkWaitForFence so that it can be able to wait on 
> fence-available? If fence is available, then it's safe to do semaphore 
> wait in vkQueueSubmit.
>

I'm sorry, but I don't understand what vkWaitForFence() has to do with 
this problem.

They test case we're struggling with doesn't use that API.

Can you maybe explain a bit more how it relates?

> b. Make waitBeforeSignal is valid for binary semaphore as well, as 
> that way, It is reasonable to add wait/signal counting for binary syncobj.
>

Yeah essentially the change we're proposing internally makes binary 
semaphores use syncobj timelines.

There is just another u64 associated with them.

-Lionel

>
> -David
>
>
> On 2019年08月02日 14:27, Lionel Landwerlin wrote:
>> On 02/08/2019 09:10, Koenig, Christian wrote:
>>>
>>>
>>> Am 02.08.2019 07:38 schrieb Lionel Landwerlin 
>>> <lionel.g.landwerlin at intel.com>:
>>>
>>>     On 02/08/2019 08:21, Koenig, Christian wrote:
>>>
>>>
>>>
>>>         Am 02.08.2019 07:17 schrieb Lionel Landwerlin
>>>         <lionel.g.landwerlin at intel.com>
>>>         <mailto:lionel.g.landwerlin at intel.com>:
>>>
>>>             On 02/08/2019 08:08, Koenig, Christian wrote:
>>>
>>>                 Hi Lionel,
>>>
>>>                 Well that looks more like your test case is buggy.
>>>
>>>                 According to the code the ctx1 queue always waits
>>>                 for sem1 and ctx2 queue always waits for sem2.
>>>
>>>
>>>             That's supposed to be the same underlying syncobj
>>>             because it's exported from one VkDevice as opaque FD
>>>             from sem1 and imported into sem2.
>>>
>>>
>>>         Well than that's still buggy and won't synchronize at all.
>>>
>>>         When ctx1 waits for a semaphore and then signals the same
>>>         semaphore there is no guarantee that ctx2 will run in
>>>         between jobs.
>>>
>>>         It's perfectly valid in this case to first run all jobs from
>>>         ctx1 and then all jobs from ctx2.
>>>
>>>
>>>     That's not really how I see the semaphores working.
>>>
>>>     The spec describe VkSemaphore as an interface to an internal
>>>     payload opaque to the application.
>>>
>>>
>>>     When ctx1 waits on the semaphore, it waits on the payload put
>>>     there by the previous iteration.
>>>
>>>
>>> And who says that it's not waiting for it's own previous payload?
>>
>>
>> That's was I understood from you previous comment : "there is no 
>> guarantee that ctx2 will run in between jobs"
>>
>>
>>>
>>> See if the payload is a counter this won't work either. Keep in mind 
>>> that this has the semantic of a semaphore. Whoever grabs the 
>>> semaphore first wins and can run, everybody else has to wait.
>>
>>
>> What performs the "grab" here?
>>
>> I thought that would be vkQueueSubmit().
>>
>> Since that occuring from a single application thread, that should 
>> then be ordered in execution of ctx1,ctx2,ctx1,...
>>
>>
>> Thanks for your time on this,
>>
>>
>> -Lionel
>>
>>
>>>
>>>     Then it proceeds to signal it by replacing the internal payload.
>>>
>>>
>>> That's an implementation detail of our sync objects, but I don't 
>>> think that this behavior is part of the Vulkan specification.
>>>
>>> Regards,
>>> Christian.
>>>
>>>
>>>     ctx2 then waits on that and replaces the payload again with the
>>>     new internal synchronization object.
>>>
>>>
>>>     The internal payload is a dma fence in our case and signaling
>>>     just replaces a dma fence by another or puts one where there was
>>>     none before.
>>>
>>>     So we should have created a dependecy link between all the
>>>     submissions and then should be executed in the order of
>>>     QueueSubmit() calls.
>>>
>>>
>>>     -Lionel
>>>
>>>
>>>
>>>         It only prevents running both at the same time and as far as
>>>         I can see that still works even with threaded submission.
>>>
>>>         You need at least two semaphores for a tandem submission.
>>>
>>>         Regards,
>>>         Christian.
>>>
>>>
>>>
>>>                 This way there can't be any Synchronisation between
>>>                 the two.
>>>
>>>                 Regards,
>>>                 Christian.
>>>
>>>                 Am 02.08.2019 06:55 schrieb Lionel Landwerlin
>>>                 <lionel.g.landwerlin at intel.com>
>>>                 <mailto:lionel.g.landwerlin at intel.com>:
>>>                 Hey Christian,
>>>
>>>                 The problem boils down to the fact that we don't
>>>                 immediately create dma fences when calling
>>>                 vkQueueSubmit().
>>>                 This is delayed to a thread.
>>>
>>>                 From a single application thread, you can
>>>                 QueueSubmit() to 2 queues from 2 different devices.
>>>                 Each QueueSubmit to one queue has a dependency on
>>>                 the previous QueueSubmit on the other queue through
>>>                 an exported/imported semaphore.
>>>
>>>                 From the API point of view the state of the
>>>                 semaphore should be changed after each QueueSubmit().
>>>                 The problem is that it's not because of the thread
>>>                 and because you might have those 2 submission
>>>                 threads tied to different VkDevice/VkInstance or
>>>                 even different applications (synchronizing
>>>                 themselves outside the vulkan API).
>>>
>>>                 Hope that makes sense.
>>>                 It's not really easy to explain by mail, the best
>>>                 explanation is probably reading the test :
>>>                 https://gitlab.freedesktop.org/mesa/crucible/blob/master/src/tests/func/sync/semaphore-fd.c#L788
>>>
>>>                 Like David mentioned you're not running into that
>>>                 issue right now, because you only dispatch to the
>>>                 thread under specific conditions.
>>>                 But I could build a case to force that and likely
>>>                 run into the same issue.
>>>
>>>                 -Lionel
>>>
>>>                 On 02/08/2019 07:33, Koenig, Christian wrote:
>>>
>>>                     Hi Lionel,
>>>
>>>                     Well could you describe once more what the
>>>                     problem is?
>>>
>>>                     Cause I don't fully understand why a rather
>>>                     normal tandem submission with two semaphores
>>>                     should fail in any way.
>>>
>>>                     Regards,
>>>                     Christian.
>>>
>>>                     Am 02.08.2019 06:28 schrieb Lionel Landwerlin
>>>                     <lionel.g.landwerlin at intel.com>
>>>                     <mailto:lionel.g.landwerlin at intel.com>:
>>>                     There aren't CTS tests covering the issue I was
>>>                     mentioning.
>>>                     But we could add them.
>>>
>>>                     I don't have all the details regarding your
>>>                     implementation but even with
>>>                     the "semaphore thread", I could see it running
>>>                     into the same issues.
>>>                     What if a mix of binary & timeline semaphores
>>>                     are handed to vkQueueSubmit()?
>>>
>>>                     For example with queueA & queueB from 2
>>>                     different VkDevice :
>>>                     vkQueueSubmit(queueA, signal semA);
>>>                     vkQueueSubmit(queueA, wait on [semA,
>>>                     timelineSemB]); with
>>>                     timelineSemB triggering a wait before signal.
>>>                     vkQueueSubmit(queueB, signal semA);
>>>
>>>
>>>                     -Lionel
>>>
>>>                     On 02/08/2019 06:18, Zhou, David(ChunMing) wrote:
>>>                     > Hi Lionel,
>>>                     >
>>>                     > By the Queue thread is a heavy thread, which
>>>                     is always resident in driver during application
>>>                     running, our guys don't like that. So we switch
>>>                     to Semaphore Thread, only when waitBeforeSignal
>>>                     of timeline happens, we spawn a thread to handle
>>>                     that wait. So we don't have your this issue.
>>>                     > By the way, I already pass all your CTS cases
>>>                     for now. I suggest you to switch to Semaphore
>>>                     Thread instead of Queue Thread as well. It works
>>>                     very well.
>>>                     >
>>>                     > -David
>>>                     >
>>>                     > -----Original Message-----
>>>                     > From: Lionel Landwerlin
>>>                     <lionel.g.landwerlin at intel.com>
>>>                     <mailto:lionel.g.landwerlin at intel.com>
>>>                     > Sent: Friday, August 2, 2019 4:52 AM
>>>                     > To: dri-devel
>>>                     <dri-devel at lists.freedesktop.org>
>>>                     <mailto:dri-devel at lists.freedesktop.org>;
>>>                     Koenig, Christian <Christian.Koenig at amd.com>
>>>                     <mailto:Christian.Koenig at amd.com>; Zhou,
>>>                     David(ChunMing) <David1.Zhou at amd.com>
>>>                     <mailto:David1.Zhou at amd.com>; Jason Ekstrand
>>>                     <jason at jlekstrand.net> <mailto:jason at jlekstrand.net>
>>>                     > Subject: Threaded submission & semaphore sharing
>>>                     >
>>>                     > Hi Christian, David,
>>>                     >
>>>                     > Sorry to report this so late in the process,
>>>                     but I think we found an issue not directly
>>>                     related to syncobj timelines themselves but with
>>>                     a side effect of the threaded submissions.
>>>                     >
>>>                     > Essentially we're failing a test in crucible :
>>>                     > func.sync.semaphore-fd.opaque-fd
>>>                     > This test create a single binary semaphore,
>>>                     shares it between 2 VkDevice/VkQueue.
>>>                     > Then in a loop it proceeds to submit workload
>>>                     alternating between the 2 VkQueue with one
>>>                     submit depending on the other.
>>>                     > It does so by waiting on the VkSemaphore
>>>                     signaled in the previous iteration and
>>>                     resignaling it.
>>>                     >
>>>                     > The problem for us is that once things are
>>>                     dispatched to the submission thread, the
>>>                     ordering of the submission is lost.
>>>                     > Because we have 2 devices and they both have
>>>                     their own submission thread.
>>>                     >
>>>                     > Jason suggested that we reestablish the
>>>                     ordering by having semaphores/syncobjs carry an
>>>                     additional uint64_t payload.
>>>                     > This 64bit integer would represent be an
>>>                     identifier that submission threads will
>>>                     WAIT_FOR_AVAILABLE on.
>>>                     >
>>>                     > The scenario would look like this :
>>>                     >       - vkQueueSubmit(queueA, signal on semA);
>>>                     >           - in the caller thread, this would
>>>                     increment the syncobj additional u64 payload and
>>>                     return it to userspace.
>>>                     >           - at some point the submission
>>>                     thread of queueA submits the workload and signal
>>>                     the syncobj of semA with value returned in the
>>>                     caller thread of vkQueueSubmit().
>>>                     >       - vkQueueSubmit(queueB, wait on semA);
>>>                     >           - in the caller thread, this would
>>>                     read the syncobj additional
>>>                     > u64 payload
>>>                     >           - at some point the submission
>>>                     thread of queueB will try to submit the work,
>>>                     but first it will WAIT_FOR_AVAILABLE the u64
>>>                     value returned in the step above
>>>                     >
>>>                     > Because we want the binary semaphores to be
>>>                     shared across processes and would like this to
>>>                     remain a single FD, the simplest location to
>>>                     store this additional u64 payload would be the
>>>                     DRM syncobj.
>>>                     > It would need an additional ioctl to read &
>>>                     increment the value.
>>>                     >
>>>                     > What do you think?
>>>                     >
>>>                     > -Lionel
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20190802/f4e89308/attachment-0001.html>