[Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice?

Tue Jan 28 20:24:33 UTC 2025

On 28.01.25 21:14, Simona Vetter wrote:
> On Tue, Jan 28, 2025 at 11:09:24AM +1100, Alistair Popple wrote:
>> On Fri, Jan 24, 2025 at 06:54:02PM +0100, David Hildenbrand wrote:
>>>>>> On integrated the gpu is tied into the coherency
>>>>>> fabric, so there it's not needed.
>>>>>>
>>>>>> I think the more fundamental question with both this function here and
>>>>>> with forced migration to device memory is that there's no guarantee it
>>>>>> will work out.
>>>>>
>>>>> Yes, in particular with device-exclusive, it doesn't really work with THP
>>>>> and is only limited to anonymous memory. I have patches to at least make it
>>>>> work reliably with THP.
>>>>
>>>> I should have crawled through the implementation first before replying.
>>>> Since it only looks at folio_mapcount() make_device_exclusive() should at
>>>> least in theory work reliably on anon memory, and not be impacted by
>>>> elevated refcounts due to migration/ksm/thp/whatever.
>>>
>>> Yes, there is -- in theory -- nothing blocking the conversion except the
>>> folio lock. That's different than page migration.
>>
>> Indeed - this was the entire motivation for make_device_exclusive() - that we
>> needed a way to reliably exclude CPU access that couldn't be blocked in the same
>> way page migration can (otherwise we could have just migrated to a device page,
>> even if that may have added unwanted overhead).
> 
> The folio_trylock worries me a bit. I guess this is to avoid deadlocks
> when locking multiple folios, but I think at least on the first one we
> need an unconditional folio_lock to guarantee forward progress.

At least on the hmm path I was able to trigger the EBUSY a couple of 
times due to concurrent swapout. But the hmm-tests selftest fails 
immediately instead of retrying.

> Since
> atomics can't cross 4k boundaries (or the hw is just really broken) this
> should be enough to avoid being stuck in a livelock. I'm also not seeing
> any other reason why a folio_lock shouldn't work here, but then my
> understanding of mm/ stuff is really just scratching the surface.
> 
> I did crawl through all the other code and it looks like everything else
> is unconditional locks. So looks all good and I didn't spot anything else
> that seemed problematic.
> 
> Somewhat aside, I do wonder whether we really want to require callers to
> hold the mmap lock, or whether with all the work towards lockless fastpath
> that shouldn't instead just be an implementation detail.

We might be able to use the VMA lock in the future, but that will 
require GUP support and a bunch more. Until then, the mm_lock in read 
mode is required.

I was not able to convince myself that we'll really need the folio lock, 
but that's also a separate discussion.

> At least for the
> gpu hmm code I've seen I've tried to push hard towards a world were the
> gpu side does not rely on mmap_read_lock being held at all, to future
> proof this all. And currently we only have one caller of
> make_device_exclusive_range() so would be simple to do.

We could likely move the mmap_lock into that function, but avoiding it 
is more effort.

In any case, I'll send something out probably tomorrow to fix page 
migration/swapout of pages with device-exclusive entries and a bunch of 
other things (THP, interaction with hugetlb, ...).

-- 
Cheers,

David / dhildenb