[Intel-gfx] [PATCH v2 2/2] drm/i915/bxt: Fix inadvertent CPU snooping due to incorrect MOCS config

Wed Apr 27 18:42:43 UTC 2016

On 27/04/16 15:53, Chris Wilson wrote:
> On Wed, Apr 27, 2016 at 04:25:09PM +0300, Eero Tamminen wrote:
>> Hi,
>>
>> On 26.04.2016 20:25, Frederick, Michael T wrote:
>>> Sorry I'm not tracking all the MOCs discussions.  I just want to indicate what the coherency means in SoC for BXT.
>>>
>>> GTI sets the non-inclusive bit on the IDI interface based on how it treats the memory.  In BXT case where there is no uncore cache, "non-inclusive" just indicates snoop or not.  BXT has a snoop filter in order to make the latency of snooping GT from a core roughly similar to snooping another core.
>>>
>>> For BXT:
>>> If GTI sets non-inclusive=0 (i.e. coherent): transaction looks up in the SF and the SA snoops the cores.  The potential impact here is that for high BW coherent traffic, the SF will become the BW limiter of the system and cap BW at 33% * 34GBps. For writes like WCILFs snoops to cores must be resolved before SA requests WR data from GT.  For reads the common case should have no impact because snoop latency is generally much less than memory data latency.  In general snoop latency for a core is relatively small, but there is also the prospect that a core could be down (e.g. ratio change) or loaded w/ snooping.
>>> If GTI sets non-inclusive=1 (i.e. non-coherent): transaction takes the SF bypass and the SA does not snoop the cores.  This is best for high-BW since it removes the SF bottleneck and doesn't require core interaction.
>>
>> Thanks for the explanation!
>>
>> AFAIK:
>>
>> * In regards to 3D driver operations, CPU side doesn't modify the
>> buffer contents while GPU is working on them.  CPU side sets up the
>> buffers (textures, VBOs, batches etc), and then (after a flush) GPU
>> is asked to act on them.
>>
>> * For things like texture streaming, the driver either internally
>> synchronizes the data or creates a new copy of it whenever
>> application tells that data is updated.  There's always some kind of
>> "upload" involved (GL API needs it as non-integrated GPU's don't
>> share memory with CPU).
>>
>> While it's possible that there's a case where snooping would be
>> needed, I cannot think of any myself.
>>
>> Daniel, Chris, did you have some concrete example in mind where 3D
>> driver would require CPU to snoop GPU?
>
> Not mesa, but X can do concurrent rendering to a Pixmap whilst also
> rendering from other parts of that Pixmap into a GPU side buffer and
> presentation/compositing thereof. X uses snooping both ways (from client
> memory to GPU and from GPU to client memory) as well as mixed rendering.
>
> Mesa should be using snooping for both SubTexImage and GetTexImage. On
> the SubTexImage path you can use the sampler to do format conversions
> that even including the sync overhead for correctness when using client
> memory avoid the awful format conversion code in mesa. Using the GPU to
> write into client memory and avoiding WC reads is approximately an
> order of magnitude (8x) faster than the current code mesa uses.
> -Chris

Presumably its useful for the CPU to snoop the h/w status page(s), and 
maybe the ring-context part of a context image (so that TAIL updates are 
coherent), but OTOH snooping the rest of the context image might add 
overhead, and AFAIK we don't normally read (or write) any of that after 
setup. So maybe we don't want vmap-whole-object after all?

.Dave.