[RFC PATCH] drm: disable WC optimization for cache coherent devices on non-x86

Tue Jan 22 21:07:07 UTC 2019

On Tue, 22 Jan 2019 at 21:56, Alex Deucher <alexdeucher at gmail.com> wrote:
>
> On Tue, Jan 22, 2019 at 4:19 AM Ard Biesheuvel
> <ard.biesheuvel at linaro.org> wrote:
> >
> > On Mon, 21 Jan 2019 at 20:04, Michel Dänzer <michel at daenzer.net> wrote:
> > >
> > > On 2019-01-21 7:28 p.m., Ard Biesheuvel wrote:
> > > > On Mon, 21 Jan 2019 at 19:24, Michel Dänzer <michel at daenzer.net> wrote:
> > > >> On 2019-01-21 7:20 p.m., Ard Biesheuvel wrote:
> > > >>> On Mon, 21 Jan 2019 at 19:04, Michel Dänzer <michel at daenzer.net> wrote:
> > > >>>> On 2019-01-21 6:59 p.m., Ard Biesheuvel wrote:
> > > >>>>> On Mon, 21 Jan 2019 at 18:55, Michel Dänzer <michel at daenzer.net> wrote:
> > > >>>>>> On 2019-01-21 5:30 p.m., Ard Biesheuvel wrote:
> > > >>>>>>> On Mon, 21 Jan 2019 at 17:22, Christoph Hellwig <hch at infradead.org> wrote:
> > > >>>>>>>
> > > >>>>>>>> Until that happens we should just change the driver ifdefs to default
> > > >>>>>>>> the hacks to off and only enable them on setups where we 100%
> > > >>>>>>>> positively know that they actually work.  And document that fact
> > > >>>>>>>> in big fat comments.
> > > >>>>>>>
> > > >>>>>>> Well, as I mentioned in my commit log as well, if we default to off
> > > >>>>>>> unless CONFIG_X86, we may break working setups on MIPS and Power where
> > > >>>>>>> the device is in fact non-cache coherent, and relies on this
> > > >>>>>>> 'optimization' to get things working.
> > > >>>>>>
> > > >>>>>> FWIW, the amdgpu driver doesn't rely on non-snooped transfers for
> > > >>>>>> correct basic operation (the scenario Christian brought up is a very
> > > >>>>>> specialized use-case), so that shouldn't be an issue.
> > > >>>>>
> > > >>>>> The point is that this is only true for x86.
> > > >>>>>
> > > >>>>> On other architectures, the use of non-cached mappings on the CPU side
> > > >>>>> means that you /do/ rely on non-snooped transfers, since if those
> > > >>>>> transfers turn out not to snoop inadvertently, the accesses are
> > > >>>>> incoherent with the CPU's view of memory.
> > > >>>>
> > > >>>> The driver generally only uses non-cached mappings if
> > > >>>> drm_arch/device_can_wc_memory returns true.
> > > >>>
> > > >>> Indeed. And so we should take care to only return 'true' from that
> > > >>> function if it is guaranteed that non-cached CPU mappings are coherent
> > > >>> with the mappings used by the GPU, either because that is always the
> > > >>> case (like on x86), or because we know that the platform in question
> > > >>> implements NoSnoop correctly throughout the interconnect.
> > > >>>
> > > >>> What seems to be complicating matters is that in some cases, the
> > > >>> device is non-cache coherent to begin with, so regardless of whether
> > > >>> the NoSnoop attribute is used or not, those accesses will not snoop in
> > > >>> the caches and be coherent with the non-cached mappings used by the
> > > >>> CPU. So if we restrict this optimization [on non-X86] to platforms
> > > >>> that are known to implement NoSnoop correctly, we may break platforms
> > > >>> that are implicitly NoSnoop all the time.
> > > >>
> > > >> Since the driver generally doesn't rely on non-snooped accesses for
> > > >> correctness, that couldn't "break" anything that hasn't always been broken.
> > > >
> > > > Again, that is only true on x86.
> > > >
> > > > On other architectures, DMA writes from the device may allocate in the
> > > > caches, and be invisible to the CPU when it uses non-cached mappings.
> > >
> > > Let me try one last time:
> > >
> > > If drm_arch_can_wc_memory returns false, the driver falls back to the
> > > normal mode of operation, using a cacheable CPU mapping and snooped GPU
> > > transfers, even if userspace asks (as a performance optimization) for a
> > > write-combined CPU mapping and non-snooped GPU transfers via
> > > AMDGPU_GEM_CREATE_CPU_GTT_USWC.
> >
> > Another question: when userspace requests for such a mapping to be
> > created, does this involve pages that are mapped cacheable into the
> > userland process?
>
> AMDGPU_GEM_CREATE_CPU_GTT_USWC means the buffer should be uncached and
> write combined from the CPU's perspective (hence the 'CPU' in the flag
> name).  On the GPU side if that flag is set, we do an non-snooped GPU
> mapping for better performance if the buffer ends up getting mapped
> into the GPU's address space for GPU access.
>

Yes, so much was clear. And the reason this breaks on some arm64
systems is because
a) non-snooped PCIe TLP attributes may be ignored, and
b) non-x86 CPUs do not snoop the caches when accessing uncached mappings.

I don't think there is actually any disagreement on this part. And I
think my patch is reasonable, only Christoph is objecting to it on the
grounds that drivers should not go around the DMA API and create
vmap()s of DMA pages with self chosen attributes.

What I am trying to figure out is why the coherency problem exists:
- it could be that the device is reading using cached mappings and
sees stale clean cachelines that shadow data written by the CPU using
uncached mappings
- or the device is writing with cached mappings, resulting in data to
be allocated there, but the CPU does not see it because it reads using
uncached mappings.

In the former case, I would expect cache invalidation right after the
vmap() to be sufficient, but in the latter case, it is a bit more
difficult probably.

So my question was whether there are cacheable aliases for those
uncached vmap()s being used?