[RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages

Fri Mar 26 12:33:29 UTC 2021

On 3/26/21 12:46 PM, Jason Gunthorpe wrote:
> On Fri, Mar 26, 2021 at 10:08:09AM +0100, Thomas Hellström (Intel) wrote:
>> On 3/25/21 7:24 PM, Jason Gunthorpe wrote:
>>> On Thu, Mar 25, 2021 at 07:13:33PM +0100, Thomas Hellström (Intel) wrote:
>>>> On 3/25/21 6:55 PM, Jason Gunthorpe wrote:
>>>>> On Thu, Mar 25, 2021 at 06:51:26PM +0100, Thomas Hellström (Intel) wrote:
>>>>>> On 3/24/21 9:25 PM, Dave Hansen wrote:
>>>>>>> On 3/24/21 1:22 PM, Thomas Hellström (Intel) wrote:
>>>>>>>>> We also have not been careful at *all* about how _PAGE_BIT_SOFTW* are
>>>>>>>>> used.  It's quite possible we can encode another use even in the
>>>>>>>>> existing bits.
>>>>>>>>>
>>>>>>>>> Personally, I'd just try:
>>>>>>>>>
>>>>>>>>> #define _PAGE_BIT_SOFTW5        57      /* available for programmer */
>>>>>>>>>
>>>>>>>> OK, I'll follow your advise here. FWIW I grepped for SW1 and it seems
>>>>>>>> used in a selftest, but only for PTEs AFAICT.
>>>>>>>>
>>>>>>>> Oh, and we don't care about 32-bit much anymore?
>>>>>>> On x86, we have 64-bit PTEs when running 32-bit kernels if PAE is
>>>>>>> enabled.  IOW, we can handle the majority of 32-bit CPUs out there.
>>>>>>>
>>>>>>> But, yeah, we don't care about 32-bit. :)
>>>>>> Hmm,
>>>>>>
>>>>>> Actually it makes some sense to use SW1, to make it end up in the same dword
>>>>>> as the PSE bit, as from what I can tell, reading of a 64-bit pmd_t on 32-bit
>>>>>> PAE is not atomic, so in theory a huge pmd could be modified while reading
>>>>>> the pmd_t making the dwords inconsistent.... How does that work with fast
>>>>>> gup anyway?
>>>>> It loops to get an atomic 64 bit value if the arch can't provide an
>>>>> atomic 64 bit load
>>>> Hmm, ok, I see a READ_ONCE() in gup_pmd_range(), and then the resulting pmd
>>>> is dereferenced either in try_grab_compound_head() or __gup_device_huge(),
>>>> before the pmd is compared to the value the pointer is currently pointing
>>>> to. Couldn't those dereferences be on invalid pointers?
>>> Uhhhhh.. That does look questionable, yes. Unless there is some tricky
>>> reason why a 64 bit pmd entry on a 32 bit arch either can't exist or
>>> has a stable upper 32 bits..
>>>
>>> The pte does it with ptep_get_lockless(), we probably need the same
>>> for the other levels too instead of open coding a READ_ONCE?
>>>
>>> Jason
>> TBH, ptep_get_lockless() also looks a bit fishy. it says
>> "it will not switch to a completely different present page without a TLB
>> flush in between".
>>
>> What if the following happens:
>>
>> processor 1: Reads lower dword of PTE.
>> processor 2: Zaps PTE. Gets stuck waiting to do TLB flush
>> processor 1: Reads upper dword of PTE, which is now zero.
>> processor 3: Hits a TLB miss, reads an unpopulated PTE and faults in a new
>> PTE value which happens to be the same as the original one before the zap.
>> processor 1: Reads the newly faulted in lower dword, compares to the old
>> one, gives an OK and returns a bogus PTE.
> So you are saying that while the zap will wait for the TLB flush to
> globally finish once it gets started any other processor can still
> write to the pte?
>
> I can't think of any serialization that would cause fault to wait for
> the zap/TLB flush, especially if the zap comes from the address_space
> and doesn't hold the mmap lock.

I might of course be completely wrong, but It seems there is an 
assumption made that all potentially affected processors would have a 
valid TLB entry for the PTE. Then the fault would not happen (well 
unless of course the TLB flush completes on some processors before 
getting stuck on the local_irq_disable() on processor 1).

+CC: Nick Piggin

Seems like Nick Piggin is the original author of the comment. Perhaps he 
can can clarify a bit.

/Thomas

>
> Seems worth bringing up in a bigger thread, maybe someone else knows?
>
> Jason