[RFC PATCH] drm/ttm: force cached mappings for system RAM on ARM

Will Deacon will.deacon at arm.com
Mon Jan 14 19:35:48 UTC 2019


[+ BenH and MPE]

On Mon, Jan 14, 2019 at 07:21:08PM +0000, Koenig, Christian wrote:
> Am 14.01.19 um 20:13 schrieb Will Deacon:
> > On Mon, Jan 14, 2019 at 07:07:54PM +0000, Koenig, Christian wrote:
> >> Am 14.01.19 um 18:32 schrieb Ard Biesheuvel:
> >>              - The reason remapping the CPU side as cacheable does work (which I
> >>              did test) is because the GPU's uncacheable accesses (which I assume
> >>              are made using the NoSnoop PCIe transaction attribute) are actually
> >>              emitted as cacheable in some cases.
> >>                 . On my AMD Seattle, with or without SMMU (which is stage 2 only), I
> >>              must use cacheable accesses from the CPU side or things are broken.
> >>              This might be a h/w flaw, though.
> >>                 . On systems with stage 1+2 SMMUs, the driver uses stage 1
> >>              translations which always override the memory attributes to cacheable
> >>              for DMA coherent devices. This is what is affecting the Cavium
> >>              ThunderX2 (although it appears the attributes emitted by the RC may be
> >>              incorrect as well.)
> >>
> >>              The latter issue is a shortcoming in the SMMU driver that we have to
> >>              fix, i.e., it should take care not to modify the incoming attributes
> >>              of DMA coherent PCIe devices for NoSnoop to be able to work.
> >>
> >>              So in summary, the mismatch appears to be between the CPU accessing
> >>              the vmap region with non-cacheable attributes and the GPU accessing
> >>              the same memory with cacheable attributes, resulting in a loss of
> >>              coherency and lots of visible corruption.
> >>
> >>          Actually it is the other way around. The CPU thinks some data is in the
> >>          cache and the GPU only updates the system memory version because the
> >>          snoop flag is not set.
> >>
> >>
> >>      That doesn't seem to be what is happening. As far as we can tell from
> >>      our experiments, all inbound transactions are always cacheable, and so
> >>      the only way to make things work is to ensure that the CPU uses the
> >>      same attributes.
> >>
> >>
> >> Ok that doesn't make any sense. If inbound transactions are cacheable or not is
> >> irrelevant when the CPU always uses uncached accesses.
> >>
> >> See on the PCIe side you have the snoop bit in the read/write transactions
> >> which tells the root hub if the device wants to snoop caches or not.
> >>
> >> When the CPU accesses some memory as cached then devices need to snoop the
> >> cache for coherent accesses.
> >>
> >> When the CPU accesses some memory as uncached then devices can disable snooping
> >> to improve performance, but when they don't do this it is mandated by the spec
> >> that this still works.
> > Which spec?
> 
> The PCIe spec. The snoop bit (or rather the NoSnoop) in the transaction 
> is perfectly optional IIRC.

Thanks for the clarification. I suspect the devil is in the details, so I'll
try to dig up the spec.

> > The Arm architecture (and others including Power afaiu) doesn't
> > guarantee coherency when memory is accessed using mismatched cacheability
> > attributes.
> 
> Well what exactly goes wrong on ARM?

Coherency (and any ordering guarantees) can be lost, so the device may see a
stale copy of the memory it is accessing. The architecture requires cache
maintenance to restore coherency between the mismatched aliases.

> As far as I know Power doesn't really supports un-cached memory at all, 
> except for a very very old and odd configuration with AGP.

Hopefully Michael/Ben can elaborate here, but I was under the (possibly
mistaken) impression that mismatched attributes could cause a machine-check
on Power.

> I mean in theory I agree that devices should use matching cacheability 
> attributes, but in practice I know of quite a bunch of devices/engines 
> which fails to do this correctly.

Given that the experiences of Ard and I so far has been that the system
ends up making everything cacheable after the RC, perhaps that's an attempt
by system designers to correct for these devices. Unfortunately, it doesn't
help if the CPU carefully goes ahead and establishes a non-cacheable mapping
for itself!

Will


More information about the dri-devel mailing list