[Mesa-dev] Linux Graphics Next: Userspace submission update
Marek Olšák
maraeo at gmail.com
Thu Jun 17 18:28:06 UTC 2021
The kernel will know who should touch the implicit-sync semaphore next, and
at the same time, the copy of all write requests to the implicit-sync
semaphore will be forwarded to the kernel for monitoring and bo_wait.
Syncobjs could either use the same monitored access as implicit sync or be
completely unmonitored. We haven't decided yet.
Syncfiles could either use one of the above or wait for a syncobj to go
idle before converting to a syncfile.
Marek
On Thu, Jun 17, 2021 at 12:48 PM Daniel Vetter <daniel at ffwll.ch> wrote:
> On Mon, Jun 14, 2021 at 07:13:00PM +0200, Christian König wrote:
> > As long as we can figure out who touched to a certain sync object last
> that
> > would indeed work, yes.
>
> Don't you need to know who will touch it next, i.e. who is holding up your
> fence? Or maybe I'm just again totally confused.
> -Daniel
>
> >
> > Christian.
> >
> > Am 14.06.21 um 19:10 schrieb Marek Olšák:
> > > The call to the hw scheduler has a limitation on the size of all
> > > parameters combined. I think we can only pass a 32-bit sequence number
> > > and a ~16-bit global (per-GPU) syncobj handle in one call and not much
> > > else.
> > >
> > > The syncobj handle can be an element index in a global (per-GPU)
> syncobj
> > > table and it's read only for all processes with the exception of the
> > > signal command. Syncobjs can either have per VMID write access flags
> for
> > > the signal command (slow), or any process can write to any syncobjs and
> > > only rely on the kernel checking the write log (fast).
> > >
> > > In any case, we can execute the memory write in the queue engine and
> > > only use the hw scheduler for logging, which would be perfect.
> > >
> > > Marek
> > >
> > > On Thu, Jun 10, 2021 at 12:33 PM Christian König
> > > <ckoenig.leichtzumerken at gmail.com
> > > <mailto:ckoenig.leichtzumerken at gmail.com>> wrote:
> > >
> > > Hi guys,
> > >
> > > maybe soften that a bit. Reading from the shared memory of the
> > > user fence is ok for everybody. What we need to take more care of
> > > is the writing side.
> > >
> > > So my current thinking is that we allow read only access, but
> > > writing a new sequence value needs to go through the
> scheduler/kernel.
> > >
> > > So when the CPU wants to signal a timeline fence it needs to call
> > > an IOCTL. When the GPU wants to signal the timeline fence it needs
> > > to hand that of to the hardware scheduler.
> > >
> > > If we lockup the kernel can check with the hardware who did the
> > > last write and what value was written.
> > >
> > > That together with an IOCTL to give out sequence number for
> > > implicit sync to applications should be sufficient for the kernel
> > > to track who is responsible if something bad happens.
> > >
> > > In other words when the hardware says that the shader wrote stuff
> > > like 0xdeadbeef 0x0 or 0xffffffff into memory we kill the process
> > > who did that.
> > >
> > > If the hardware says that seq - 1 was written fine, but seq is
> > > missing then the kernel blames whoever was supposed to write seq.
> > >
> > > Just pieping the write through a privileged instance should be
> > > fine to make sure that we don't run into issues.
> > >
> > > Christian.
> > >
> > > Am 10.06.21 um 17:59 schrieb Marek Olšák:
> > > > Hi Daniel,
> > > >
> > > > We just talked about this whole topic internally and we came up
> > > > to the conclusion that the hardware needs to understand sync
> > > > object handles and have high-level wait and signal operations in
> > > > the command stream. Sync objects will be backed by memory, but
> > > > they won't be readable or writable by processes directly. The
> > > > hardware will log all accesses to sync objects and will send the
> > > > log to the kernel periodically. The kernel will identify
> > > > malicious behavior.
> > > >
> > > > Example of a hardware command stream:
> > > > ...
> > > > ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence
> > > > number is assigned by the kernel
> > > > Draw();
> > > > ImplicitSyncSignalWhenDone(syncObjHandle);
> > > > ...
> > > >
> > > > I'm afraid we have no other choice because of the TLB
> > > > invalidation overhead.
> > > >
> > > > Marek
> > > >
> > > >
> > > > On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter <daniel at ffwll.ch
> > > > <mailto:daniel at ffwll.ch>> wrote:
> > > >
> > > > On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König
> wrote:
> > > > > Am 09.06.21 um 15:19 schrieb Daniel Vetter:
> > > > > > [SNIP]
> > > > > > > Yeah, we call this the lightweight and the heavyweight
> > > > tlb flush.
> > > > > > >
> > > > > > > The lighweight can be used when you are sure that you
> > > > don't have any of the
> > > > > > > PTEs currently in flight in the 3D/DMA engine and you
> > > > just need to
> > > > > > > invalidate the TLB.
> > > > > > >
> > > > > > > The heavyweight must be used when you need to
> > > > invalidate the TLB *AND* make
> > > > > > > sure that no concurrently operation moves new stuff
> > > > into the TLB.
> > > > > > >
> > > > > > > The problem is for this use case we have to use the
> > > > heavyweight one.
> > > > > > Just for my own curiosity: So the lightweight flush is
> > > > only for in-between
> > > > > > CS when you know access is idle? Or does that also not
> > > > work if userspace
> > > > > > has a CS on a dma engine going at the same time because
> > > > the tlb aren't
> > > > > > isolated enough between engines?
> > > > >
> > > > > More or less correct, yes.
> > > > >
> > > > > The problem is a lightweight flush only invalidates the
> > > > TLB, but doesn't
> > > > > take care of entries which have been handed out to the
> > > > different engines.
> > > > >
> > > > > In other words what can happen is the following:
> > > > >
> > > > > 1. Shader asks TLB to resolve address X.
> > > > > 2. TLB looks into its cache and can't find address X so it
> > > > asks the walker
> > > > > to resolve.
> > > > > 3. Walker comes back with result for address X and TLB puts
> > > > that into its
> > > > > cache and gives it to Shader.
> > > > > 4. Shader starts doing some operation using result for
> > > > address X.
> > > > > 5. You send lightweight TLB invalidate and TLB throws away
> > > > cached values for
> > > > > address X.
> > > > > 6. Shader happily still uses whatever the TLB gave to it in
> > > > step 3 to
> > > > > accesses address X
> > > > >
> > > > > See it like the shader has their own 1 entry L0 TLB cache
> > > > which is not
> > > > > affected by the lightweight flush.
> > > > >
> > > > > The heavyweight flush on the other hand sends out a
> > > > broadcast signal to
> > > > > everybody and only comes back when we are sure that an
> > > > address is not in use
> > > > > any more.
> > > >
> > > > Ah makes sense. On intel the shaders only operate in VA,
> > > > everything goes
> > > > around as explicit async messages to IO blocks. So we don't
> > > > have this, the
> > > > only difference in tlb flushes is between tlb flush in the IB
> > > > and an mmio
> > > > one which is independent for anything currently being
> > > > executed on an
> > > > egine.
> > > > -Daniel
> > > > -- Daniel Vetter
> > > > Software Engineer, Intel Corporation
> > > > http://blog.ffwll.ch <http://blog.ffwll.ch>
> > > >
> > >
> >
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20210617/230364fe/attachment-0001.htm>
More information about the dri-devel
mailing list