<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
Hi guys,<br>
<br>
maybe soften that a bit. Reading from the shared memory of the user
fence is ok for everybody. What we need to take more care of is the
writing side.<br>
<br>
So my current thinking is that we allow read only access, but
writing a new sequence value needs to go through the
scheduler/kernel.<br>
<br>
So when the CPU wants to signal a timeline fence it needs to call an
IOCTL. When the GPU wants to signal the timeline fence it needs to
hand that of to the hardware scheduler.<br>
<br>
If we lockup the kernel can check with the hardware who did the last
write and what value was written.<br>
<br>
That together with an IOCTL to give out sequence number for implicit
sync to applications should be sufficient for the kernel to track
who is responsible if something bad happens.<br>
<br>
In other words when the hardware says that the shader wrote stuff
like 0xdeadbeef 0x0 or 0xffffffff into memory we kill the process
who did that.<br>
<br>
If the hardware says that seq - 1 was written fine, but seq is
missing then the kernel blames whoever was supposed to write seq.<br>
<br>
Just pieping the write through a privileged instance should be fine
to make sure that we don't run into issues. <br>
<br>
Christian.<br>
<br>
<div class="moz-cite-prefix">Am 10.06.21 um 17:59 schrieb Marek
Olšák:<br>
</div>
<blockquote type="cite"
cite="mid:CAAxE2A6zwCHPaP5NnRETVe_BOsoVQK1T=h8gqRnUtP4sRFBkrw@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">
<div>Hi Daniel,</div>
<div><br>
</div>
<div>We just talked about this whole topic internally and we
came up to the conclusion that the hardware needs to
understand sync object handles and have high-level wait and
signal operations in the command stream. Sync objects will be
backed by memory, but they won't be readable or writable by
processes directly. The hardware will log all accesses to sync
objects and will send the log to the kernel periodically. The
kernel will identify malicious behavior.<br>
</div>
<div><br>
</div>
<div>Example of a hardware command stream:</div>
<div>...</div>
<div>ImplicitSyncWait(syncObjHandle, sequenceNumber); // the
sequence number is assigned by the kernel<br>
</div>
<div>Draw();</div>
<div>ImplicitSyncSignalWhenDone(syncObjHandle);</div>
<div>...</div>
<div><br>
</div>
<div>I'm afraid we have no other choice because of the TLB
invalidation overhead.</div>
<div><br>
</div>
<div>Marek<br>
</div>
<div><br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Wed, Jun 9, 2021 at 2:31 PM
Daniel Vetter <<a href="mailto:daniel@ffwll.ch"
moz-do-not-send="true">daniel@ffwll.ch</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On
Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:<br>
> Am 09.06.21 um 15:19 schrieb Daniel Vetter:<br>
> > [SNIP]<br>
> > > Yeah, we call this the lightweight and the
heavyweight tlb flush.<br>
> > > <br>
> > > The lighweight can be used when you are sure
that you don't have any of the<br>
> > > PTEs currently in flight in the 3D/DMA engine
and you just need to<br>
> > > invalidate the TLB.<br>
> > > <br>
> > > The heavyweight must be used when you need to
invalidate the TLB *AND* make<br>
> > > sure that no concurrently operation moves new
stuff into the TLB.<br>
> > > <br>
> > > The problem is for this use case we have to use
the heavyweight one.<br>
> > Just for my own curiosity: So the lightweight flush
is only for in-between<br>
> > CS when you know access is idle? Or does that also
not work if userspace<br>
> > has a CS on a dma engine going at the same time
because the tlb aren't<br>
> > isolated enough between engines?<br>
> <br>
> More or less correct, yes.<br>
> <br>
> The problem is a lightweight flush only invalidates the
TLB, but doesn't<br>
> take care of entries which have been handed out to the
different engines.<br>
> <br>
> In other words what can happen is the following:<br>
> <br>
> 1. Shader asks TLB to resolve address X.<br>
> 2. TLB looks into its cache and can't find address X so
it asks the walker<br>
> to resolve.<br>
> 3. Walker comes back with result for address X and TLB
puts that into its<br>
> cache and gives it to Shader.<br>
> 4. Shader starts doing some operation using result for
address X.<br>
> 5. You send lightweight TLB invalidate and TLB throws
away cached values for<br>
> address X.<br>
> 6. Shader happily still uses whatever the TLB gave to it
in step 3 to<br>
> accesses address X<br>
> <br>
> See it like the shader has their own 1 entry L0 TLB cache
which is not<br>
> affected by the lightweight flush.<br>
> <br>
> The heavyweight flush on the other hand sends out a
broadcast signal to<br>
> everybody and only comes back when we are sure that an
address is not in use<br>
> any more.<br>
<br>
Ah makes sense. On intel the shaders only operate in VA,
everything goes<br>
around as explicit async messages to IO blocks. So we don't
have this, the<br>
only difference in tlb flushes is between tlb flush in the IB
and an mmio<br>
one which is independent for anything currently being executed
on an<br>
egine.<br>
-Daniel<br>
-- <br>
Daniel Vetter<br>
Software Engineer, Intel Corporation<br>
<a href="http://blog.ffwll.ch" rel="noreferrer"
target="_blank" moz-do-not-send="true">http://blog.ffwll.ch</a><br>
</blockquote>
</div>
</blockquote>
<br>
</body>
</html>