TTM huge page-faults WAS: Re: [RFC PATCH 1/2] x86: Don't let pgprot_modify() change the page encryption bit

Wed Sep 11 15:08:36 UTC 2019

On 9/11/19 4:06 PM, Koenig, Christian wrote:
> Am 11.09.19 um 12:10 schrieb Thomas Hellström (VMware):
> [SNIP]
>>>> The problem seen in TTM is that we want to be able to change the
>>>> vm_page_prot from the fault handler, but it's problematic since we
>>>> have the mmap_sem typically only in read mode. Hence the fake vma
>>>> hack. From what I can tell it's reasonably well-behaved, since
>>>> pte_modify() skips the bits TTM updates, so mprotect() and mremap()
>>>> works OK. I think split_huge_pmd may run into trouble, but we don't
>>>> support it (yet) with TTM.
>>> Ah! I actually ran into this while implementing huge page support for
>>> TTM and never figured out why that doesn't work. Dropped CPU huge page
>>> support because of this.
>> By incident, I got slightly sidetracked the other day and started
>> looking at this as well. Got to the point where I figured out all the
>> hairy alignment issues and actually got huge_fault() calls, but never
>> implemented the handler. I think that's definitely something worth
>> having. Not sure it will work for IO memory, though, (split_huge_pmd
>> will just skip non-page-backed memory) but if we only support
>> VM_SHARED (non COW) vmas there's no reason to split the huge pmds
>> anyway. Definitely something we should have IMO.
> Well our primary use case would be IO memory, cause system memory is
> only optionally allocate as huge page but we nearly always allocate VRAM
> in chunks of at least 2MB because we otherwise get a huge performance
> penalty.

But that system memory option is on by default, right? In any case, a 
request for a huge_fault
would probably need to check that there is actually an underlying 
huge_page and otherwise fallback to ordinary faults.

Another requirement would be for IO memory allocations to be 
PMD_PAGE_SIZE aligned in the mappable aperture, to avoid fallbacks to 
ordinary faults. Probably increasing fragmentation somewhat. (Seems like 
pmd entries can only point to PMD_PAGE_SIZE aligned physical addresses) 
Would that work for you?

>>>> We could probably get away with a WRITE_ONCE() update of the
>>>> vm_page_prot before taking the page table lock since
>>>>
>>>> a) We're locking out all other writers.
>>>> b) We can't race with another fault to the same vma since we hold an
>>>> address space lock ("buffer object reservation")
>>>> c) When we need to update there are no valid page table entries in the
>>>> vma, since it only happens directly after mmap(), or after an
>>>> unmap_mapping_range() with the same address space lock. When another
>>>> reader (for example split_huge_pmd()) sees a valid page table entry,
>>>> it also sees the new page protection and things are fine.
>>> Yeah, that's exactly why I always wondered why we need this hack with
>>> the vma copy on the stack.
>>>
>>>> But that would really be a special case. To solve this properly we'd
>>>> probably need an additional lock to protect the vm_flags and
>>>> vm_page_prot, taken after mmap_sem and i_mmap_lock.
>>> Well we already have a special lock for this: The reservation object. So
>>> memory barriers etc should be in place and I also think we can just
>>> update the vm_page_prot on the fly.
>> I agree. This is needed for huge pages. We should make this change,
>> and perhaps add the justification above as a comment.
> Alternatively we could introduce a new VM_* flag telling users of
> vm_page_prot to just let the pages table entries be filled by faults again

An interesting idea, although we'd lose things like dirty-tracking bits.

/Thomas

>
> Christian.