stupid PAT :)

Thu Aug 28 21:01:12 UTC 2025

On 26.08.25 16:27, Thomas Hellström wrote:
> Hi, Christian,
> 
> On Tue, 2025-08-26 at 11:56 +0200, Christian König wrote:
>> On 26.08.25 11:17, David Hildenbrand wrote:
>>> On 26.08.25 11:00, Christian König wrote:
>>>> On 26.08.25 10:46, David Hildenbrand wrote:
>>>>>>> So my assumption would be that that is missing for the
>>>>>>> drivers here?
>>>>>>
>>>>>> Well yes and no.
>>>>>>
>>>>>> See the PAT is optimized for applying specific caching
>>>>>> attributes to ranges [A..B] (e.g. it uses an R/B tree). But
>>>>>> what drivers do here is that they have single pages (usually
>>>>>> for get_free_page or similar) and want to apply a certain
>>>>>> caching attribute to it.
>>>>>>
>>>>>> So what would happen is that we completely clutter the R/B
>>>>>> tree used by the PAT with thousands if not millions of
>>>>>> entries.
>>>>>>
>>>>>
>>>>> Hm, above you're saying that there is no direct map, but now
>>>>> you are saying that the pages were obtained through
>>>>> get_free_page()?
>>>>
>>>> The problem only happens with highmem pages on 32bit kernels.
>>>> Those pages are not in the linear mapping.
>>>
>>> Right, in the common case there is a direct map.
>>>
>>>>
>>>>> I agree that what you describe here sounds suboptimal. But if
>>>>> the pages where obtained from the buddy, there surely is a
>>>>> direct map -- unless we explicitly remove it :(
>>>>>
>>>>> If we're talking about individual pages without a directmap, I
>>>>> would wonder if they are actually part of a bigger memory
>>>>> region that can just be reserved in one go (similar to how
>>>>> remap_pfn_range()) would handle it.
>>>>>
>>>>> Can you briefly describe how your use case obtains these PFNs,
>>>>> and how scattered tehy + their caching attributes might be?
>>>>
>>>> What drivers do is to call get_free_page() or alloc_pages_node()
>>>> with the GFP_HIGHUSER flag set.
>>>>
>>>> For non highmem pages drivers then calls set_pages_wc/uc() which
>>>> changes the caching of the linear mapping, but for highmem pages
>>>> there is no linear mapping so set_pages_wc() or set_pages_uc()
>>>> doesn't work and drivers avoid calling it.
>>>>
>>>> Those are basically just random system memory pages. So they are
>>>> potentially scattered over the whole memory address space.
>>>
>>> Thanks, that's valuable information.
>>>
>>> So essentially these drivers maintain their own consistency and PAT
>>> is not aware of that.
>>>
>>> And the real problem is ordinary system RAM.
>>>
>>> There are various ways forward.
>>>
>>> 1) We use another interface that consumes pages instead of PFNs,
>>> like a
>>>     vm_insert_pages_pgprot() we would be adding.
>>>
>>>     Is there any strong requirement for inserting non-refcounted
>>> PFNs?
>>
>> Yes, there is a strong requirement to insert non-refcounted PFNs.
>>
>> We had a lot of trouble with KVM people trying to grab a reference to
>> those pages even if the VMA had the VM_PFNMAP flag set.
>>
>>> 2) We add another interface that consumes PFNs, but explicitly
>>> states
>>>     that it is only for ordinary system RAM, and that the user is
>>>     required for updating the direct map.
>>>
>>>     We could sanity-check the direct map in debug kernels.
>>
>> I would rather like to see vmf_insert_pfn_prot() fixed instead.
>>
>> That function was explicitly added to insert the PFN with the given
>> attributes and as far as I can see all users of that function expect
>> exactly that.
>>
>>>
>>> 3) We teach PAT code in pfnmap_setup_cachemode_pfn() about treating
>>> this
>>>     system RAM differently.
>>>
>>>
>>> There is also the option for a mixture between 1 and 2, where we
>>> get pages, but we map them non-refcounted in a VM_PFNMAP.
>>>
>>> In general, having pages makes it easier to assert that they are
>>> likely ordinary system ram pages, and that the interface is not
>>> getting abused for something else.
>>
>> Well, exactly that's the use case here and that is not abusive at all
>> as far as I can see.
>>
>> What drivers want is to insert a PFN with a certain set of caching
>> attributes regardless if it's system memory or iomem. That's why
>> vmf_insert_pfn_prot() was created in the first place.
>>
>> That drivers need to call set_pages_wc/uc() for the linear mapping on
>> x86 manually is correct and checking that is clearly a good idea for
>> debug kernels.
> 
> So where is this trending? Is the current suggestion to continue
> disallowing aliased mappings with conflicting caching modes and enforce
> checks in debug kernels?

Not sure, it's a mess. The big question is to find out when it is really 
ok to bypass PAT and when to better let it have a saying.

-- 
Cheers

David / dhildenb