[Nouveau] [RFC] drm/nouveau: disable caching for VRAM BOs on ARM

Mon May 26 22:18:40 PDT 2014

On Mon, May 26, 2014 at 7:42 PM, Alexandre Courbot <gnurou at gmail.com> wrote:
> On Tue, May 27, 2014 at 10:07 AM, Stéphane Marchesin
> <stephane.marchesin at gmail.com> wrote:
>> On Mon, May 26, 2014 at 5:02 PM, Alexandre Courbot <gnurou at gmail.com> wrote:
>>> On Mon, May 26, 2014 at 6:21 PM, Lucas Stach <l.stach at pengutronix.de> wrote:
>>>> Am Montag, den 26.05.2014, 09:45 +0300 schrieb Terje Bergström:
>>>>> On 23.05.2014 17:40, Alex Courbot wrote:
>>>>> > On 05/23/2014 06:59 PM, Lucas Stach wrote:
>>>>> > So after checking with more knowledgeable people, it turns out this is
>>>>> > the expected behavior on ARM and BAR regions should be mapped uncached
>>>>> > on GK20A. All the more reasons to avoid using the BAR at all.
>>>>>
>>>>> This is actually specific to Tegra.
>>>>>
>>>>> >> You may want to make yourself aware of all the quirks required for
>>>>> >> sharing memory between the GPU and CPU on an ARM host. I think there are
>>>>> >> far more involved than what you see now and writing an replacement for
>>>>> >> TTM will not be an easy task.
>>>>> >>
>>>>> >> Doing away with the concept of two memory areas will not get you to a
>>>>> >> single unified address space. You would have to deal with things like
>>>>> >> not being able to change the caching state of pages in the systems
>>>>> >> lowmem yourself. You will still have to deal with remapping pages that
>>>>> >> aren't currently visible to the CPU (ok this is not an issue on Jetson
>>>>> >> right now as it only has 2GB of RAM), because it's in systems highmem,
>>>>> >> or even in a different LPAE area.
>>>>> >>
>>>>> >> You really want to be sure you are aware of all the consequences of
>>>>> >> this, before considering this task.
>>>>> >
>>>>> > Yep, that's why I am seeking advice here. My first hope is that with a
>>>>> > few tweaks we will be able to keep using TTM and the current nouveau_bo
>>>>> > implementation. But unless I missed something this is not going to be easy.
>>>>> >
>>>>> > We can also use something like the patch I originally sent to make it
>>>>> > work, although not with good performance, on GK20A. Not very graceful,
>>>>> > but it will allow applications to run.
>>>>> >
>>>>> > In the long run though, we will want to achieve better performance, and
>>>>> > it seems like a BO implementation targeted at UMA devices would also be
>>>>> > beneficial to quite a few desktop GPUs. So as tricky as it may be I'm
>>>>> > interested in gathering thoughts and why not giving it a first try with
>>>>> > GK20A, even if it imposes some limitations like having buffers in lowmem
>>>>> > in a first time (we can probably live with this one for a short while,
>>>>> > and 64 bits will also be coming to the rescue :))
>>>>>
>>>>> I don't think lowmem or LPAE is any problem, if the memory manager is
>>>>> designed with that in mind. Vast majority of the buffers kernel
>>>>> allocates do not need to be touched in kernel space.
>>>>>
>>>>> Actually I can't think of any buffers that we allocate on behalf of user
>>>>> space that would need to be permanently mapped also to kernel. In case
>>>>> or relocs only push buffer needs to be temporarily mapped to kernel.
>>>>>
>>>>> Ultimately even relocs are not necessary if we expose GPU virtual
>>>>> addresses directly to user space. But that's another topic.
>>>>>
>>>> Nouveau already exposes constant virtual addresses to userspace and
>>>> skips the pushbuf patching when the presumed offset from userspace is
>>>> the same as what the kernel thinks it should be.
>>>>
>>>> The problem with lowmem on ARM is that you can't unmap those pages from
>>>> the kernel cached mapping. So if you alloc a page, give it to userspace
>>>> and userspace decides to map the page WC you just produced a conflicting
>>>> mapping, which may yield undefined results on ARMv7. You may think this
>>>> is not a problem as you are not touching the kernel cached mapping, but
>>>> in fact it is. The CPUs prefetcher can still access this mapping.
>>>
>>> Why would this memory be mapped into the kernel?
>>
>> On ARM the kernel keeps a linear mapping of lowmem using sections
>> (ARM's version of huge pages). This is always cached, and because the
>> sections are not 4k, it's a pain to remove parts of it. See
>> arch/arm/mm/mmu.c
>
> Ah, are we talking about the directly-mapped low memory region
> starting at PAGE_OFFSET? Ok, it makes sense now, thanks.
>
> But it seems to me that such different mappings can also happen in
> many other scenarios as well, don't they? How is the issue handled in
> these cases?

It depends. A lot of cache controllers actually implement a solution
for that in hardware, in the cache controller. For example I think
Tegra2 is one of those platforms. And then a lot of platforms just
ignore the issue completely because it has very low probability.

Stéphane