[Nouveau] [RFC] drm/nouveau: disable caching for VRAM BOs on ARM

Mon May 26 18:07:04 PDT 2014

On Mon, May 26, 2014 at 5:02 PM, Alexandre Courbot <gnurou at gmail.com> wrote:
> On Mon, May 26, 2014 at 6:21 PM, Lucas Stach <l.stach at pengutronix.de> wrote:
>> Am Montag, den 26.05.2014, 09:45 +0300 schrieb Terje Bergström:
>>> On 23.05.2014 17:40, Alex Courbot wrote:
>>> > On 05/23/2014 06:59 PM, Lucas Stach wrote:
>>> > So after checking with more knowledgeable people, it turns out this is
>>> > the expected behavior on ARM and BAR regions should be mapped uncached
>>> > on GK20A. All the more reasons to avoid using the BAR at all.
>>>
>>> This is actually specific to Tegra.
>>>
>>> >> You may want to make yourself aware of all the quirks required for
>>> >> sharing memory between the GPU and CPU on an ARM host. I think there are
>>> >> far more involved than what you see now and writing an replacement for
>>> >> TTM will not be an easy task.
>>> >>
>>> >> Doing away with the concept of two memory areas will not get you to a
>>> >> single unified address space. You would have to deal with things like
>>> >> not being able to change the caching state of pages in the systems
>>> >> lowmem yourself. You will still have to deal with remapping pages that
>>> >> aren't currently visible to the CPU (ok this is not an issue on Jetson
>>> >> right now as it only has 2GB of RAM), because it's in systems highmem,
>>> >> or even in a different LPAE area.
>>> >>
>>> >> You really want to be sure you are aware of all the consequences of
>>> >> this, before considering this task.
>>> >
>>> > Yep, that's why I am seeking advice here. My first hope is that with a
>>> > few tweaks we will be able to keep using TTM and the current nouveau_bo
>>> > implementation. But unless I missed something this is not going to be easy.
>>> >
>>> > We can also use something like the patch I originally sent to make it
>>> > work, although not with good performance, on GK20A. Not very graceful,
>>> > but it will allow applications to run.
>>> >
>>> > In the long run though, we will want to achieve better performance, and
>>> > it seems like a BO implementation targeted at UMA devices would also be
>>> > beneficial to quite a few desktop GPUs. So as tricky as it may be I'm
>>> > interested in gathering thoughts and why not giving it a first try with
>>> > GK20A, even if it imposes some limitations like having buffers in lowmem
>>> > in a first time (we can probably live with this one for a short while,
>>> > and 64 bits will also be coming to the rescue :))
>>>
>>> I don't think lowmem or LPAE is any problem, if the memory manager is
>>> designed with that in mind. Vast majority of the buffers kernel
>>> allocates do not need to be touched in kernel space.
>>>
>>> Actually I can't think of any buffers that we allocate on behalf of user
>>> space that would need to be permanently mapped also to kernel. In case
>>> or relocs only push buffer needs to be temporarily mapped to kernel.
>>>
>>> Ultimately even relocs are not necessary if we expose GPU virtual
>>> addresses directly to user space. But that's another topic.
>>>
>> Nouveau already exposes constant virtual addresses to userspace and
>> skips the pushbuf patching when the presumed offset from userspace is
>> the same as what the kernel thinks it should be.
>>
>> The problem with lowmem on ARM is that you can't unmap those pages from
>> the kernel cached mapping. So if you alloc a page, give it to userspace
>> and userspace decides to map the page WC you just produced a conflicting
>> mapping, which may yield undefined results on ARMv7. You may think this
>> is not a problem as you are not touching the kernel cached mapping, but
>> in fact it is. The CPUs prefetcher can still access this mapping.
>
> Why would this memory be mapped into the kernel?

On ARM the kernel keeps a linear mapping of lowmem using sections
(ARM's version of huge pages). This is always cached, and because the
sections are not 4k, it's a pain to remove parts of it. See
arch/arm/mm/mmu.c

That said, I don't think this issue exists on A15 (which is what those
GPUs are paired with), so it's a purely theoretical problem.

Stéphane