[Intel-gfx] [RFC] drm/i915/tgl: Advanced preparser support for GPU relocs

Fri Aug 23 15:28:42 UTC 2019

Quoting Chris Wilson (2019-08-23 16:10:48)
> Quoting Daniele Ceraolo Spurio (2019-08-23 16:05:45)
> > 
> > 
> > On 8/23/19 7:26 AM, Chris Wilson wrote:
> > > Quoting Chris Wilson (2019-08-23 08:27:25)
> > >> Quoting Daniele Ceraolo Spurio (2019-08-23 03:09:09)
> > >>> TGL has an improved CS pre-parser that can now pre-fetch commands across
> > >>> batch boundaries. This improves performances when lots of small batches
> > >>> are used, but has an impact on self-modifying code. If we want to modify
> > >>> the content of a batch from another ring/batch, we need to either
> > >>> guarantee that the memory location is updated before the pre-parser gets
> > >>> to it or we need to turn the pre-parser off around the modification.
> > >>> In i915, we use self-modifying code only for GPU relocations.
> > >>>
> > >>> The pre-parser fetches across memory synchronization commands as well,
> > >>> so the only way to guarantee that the writes land before the parser gets
> > >>> to it is to have more instructions between the sync and the destination
> > >>> than the parser FIFO depth, which is not an optimal solution.
> > >>
> > >> Well, our ABI is that memory is coherent before the breadcrumb of *each*
> > >> batch. That is a fundamental requirement for our signaling to userspace.
> > >> Please tell me that there is a context flag to turn this off, or we else
> > >> we need to emit 32x flushes or whatever it takes.
> > > 
> > Are you referring to the specific case where we have a request modifying 
> > an object that is then used as a batch in the next request? Because 
> > coherency of objects that are not executed as batches is not impacted.
> 
> "Fetches across memory sync" sounds like a major ABI break. The batches
> are a hard serialisation barrier, with memory coherency guaranteed prior
> to the signaling at the end of one batch and clear caches guaranteed at
> the start of the next.

We have relocs, oa and sseu all using self-modifying code. I expect we
will have PTE modifications and much more done via the GPU in the near
future. All rely on the CS_STALL doing exactly what it says on the tin.
-Chris