[PATCH v1 04/12] mm/rmap: implement make_device_exclusive() using folio_walk instead of rmap walk

Thu Jan 30 09:47:29 UTC 2025

On 30.01.25 10:40, Simona Vetter wrote:
> On Thu, Jan 30, 2025 at 05:11:49PM +1100, Alistair Popple wrote:
>> On Wed, Jan 29, 2025 at 12:54:02PM +0100, David Hildenbrand wrote:
>>> We require a writable PTE and only support anonymous folio: we can only
>>> have exactly one PTE pointing at that page, which we can just lookup
>>> using a folio walk, avoiding the rmap walk and the anon VMA lock.
>>>
>>> So let's stop doing an rmap walk and perform a folio walk instead, so we
>>> can easily just modify a single PTE and avoid relying on rmap/mapcounts.
>>>
>>> We now effectively work on a single PTE instead of multiple PTEs of
>>> a large folio, allowing for conversion of individual PTEs from
>>> non-exclusive to device-exclusive -- note that the other way always
>>> worked on single PTEs.
>>>
>>> We can drop the MMU_NOTIFY_EXCLUSIVE MMU notifier call and document why
>>> that is not required: GUP will already take care of the
>>> MMU_NOTIFY_EXCLUSIVE call if required (there is already a device-exclusive
>>> entry) when not finding a present PTE and having to trigger a fault and
>>> ending up in remove_device_exclusive_entry().
>>
>> I will have to look at this a bit more closely tomorrow but this doesn't seem
>> right to me. We may be transitioning from a present PTE (ie. a writable
>> anonymous mapping) to a non-present PTE (ie. a device-exclusive entry) and
>> therefore any secondary processors (eg. other GPUs, iommus, etc.) will need to
>> update their copies of the PTE. So I think the notifier call is needed.
> 
> I guess this is a question of semantics we want, for multiple gpus do we
> require that device-exclusive also excludes other gpus or not. I'm leaning
> towards agreeing with you here.

See my reply, it's also relevant for non-device, such as KVM. So it's 
the right thing to do.

> 
>>> Note that the PTE is
>>> always writable, and we can always create a writable-device-exclusive
>>> entry.
>>>
>>> With this change, device-exclusive is fully compatible with THPs /
>>> large folios. We still require PMD-sized THPs to get PTE-mapped, and
>>> supporting PMD-mapped THP (without the PTE-remapping) is a different
>>> endeavour that might not be worth it at this point.
> 
> I'm not sure we actually want hugepages for device exclusive, since it has
> an impact on what's allowed and what not. If we only ever do 4k entries
> then userspace can assume that as long atomics are separated by a 4k page
> there's no issue when both the gpu and cpu hammer on them. If we try to
> keep thp entries then suddenly a workload that worked before will result
> in endless ping-pong between gpu and cpu because the separate atomic
> counters (or whatever) now all sit in the same 2m page.

Agreed. And the conversion + mapping into the device gets trickier.

> 
> So going with thp might result in userspace having to spread out atomics
> even more, which is just wasting memory and not saving any tlb entries
> since often you don't need that many.
> 
> tldr; I think not supporting thp entries for device exclusive is a
> feature, not a bug.

So, you agree with my "different endeavour that might not be worth it" 
statement?

-- 
Cheers,

David / dhildenb