[PATCH 1/2] drm/ttm: rework ttm_tt page limit v2

Thu Dec 17 15:35:55 UTC 2020

Am 17.12.20 um 16:26 schrieb Daniel Vetter:
> On Thu, Dec 17, 2020 at 4:10 PM Christian König
> <christian.koenig at amd.com> wrote:
>> Am 17.12.20 um 15:36 schrieb Daniel Vetter:
>>> On Thu, Dec 17, 2020 at 2:46 PM Christian König
>>> <ckoenig.leichtzumerken at gmail.com> wrote:
>>>> Am 16.12.20 um 16:09 schrieb Daniel Vetter:
>>>>> On Wed, Dec 16, 2020 at 03:04:26PM +0100, Christian König wrote:
>>>>> [SNIP]
>>>>>> +
>>>>>> +/* As long as pages are available make sure to release at least one */
>>>>>> +static unsigned long ttm_tt_shrinker_scan(struct shrinker *shrink,
>>>>>> +                                      struct shrink_control *sc)
>>>>>> +{
>>>>>> +    struct ttm_operation_ctx ctx = {
>>>>>> +            .no_wait_gpu = true
>>>>> Iirc there's an eventual shrinker limit where it gets desperate. I think
>>>>> once we hit that, we should allow gpu waits. But it's not passed to
>>>>> shrinkers for reasons, so maybe we should have a second round that tries
>>>>> to more actively shrink objects if we fell substantially short of what
>>>>> reclaim expected us to do?
>>>> I think we should try to avoid waiting for the GPU in the shrinker callback.
>>>>
>>>> When we get HMM we will have cases where the shrinker is called from
>>>> there and we can't wait for the GPU then without causing deadlocks.
>>> Uh that doesn't work. Also, the current rules are that you are allowed
>>> to call dma_fence_wait from shrinker callbacks, so that shipped sailed
>>> already. This is because shrinkers are a less restrictive context than
>>> mmu notifier invalidation, and we wait in there too.
>>>
>>> So if you can't wait in shrinkers, you also can't wait in mmu
>>> notifiers (and also not in HMM, wĥich is the same thing). Why do you
>>> need this?
>> The core concept of HMM is that pages are faulted in on demand and it is
>> perfectly valid for one of those pages to be on disk.
>>
>> So when a page fault happens we might need to be able to allocate memory
>> and fetch something from disk to handle that.
>>
>> When this memory allocation then in turn waits for the GPU which is
>> running the HMM process we are pretty much busted.
> Yeah you can't do that. That's the entire infinite fences discussions.

Yes, exactly.

> For HMM to work, we need to stop using dma_fence for userspace sync,

I was considering of separating that into a dma_fence and a hmm_fence. 
Or something like this.

> and you can only use the amdkfd style preempt fences. And preempting
> while the pagefault is pending is I thought something we require.

Yeah, problem is that most hardware can't do that :)

Getting page faults to work is hard enough, preempting while waiting for 
a fault to return is not something which was anticipated :)

> Iow, the HMM page fault handler must not be a dma-fence critical
> section, i.e. it's not allowed to hold up any dma_fence, ever.

What do you mean with that?

> One consequence of this is that you can use HMM for compute, but until
> we've revamped all the linux winsys layers, not for gl/vk. Or at least
> I'm not seeing how.
>
> Also like I said, dma_fence_wait is already allowed in mmu notifiers,
> so we've already locked down these semantics even more. Due to the
> nesting of gfp allocation contexts allowing dma_fence_wait in mmu
> notifiers (i.e. __GFP_ALLOW_RECLAIM or whatever the flag is exactly)
> implies it's allowed in shrinkers. And only if you forbid it from from
> all allocations contexts (which makes all buffer object managed gpu
> memory essentially pinned, exactly what you're trying to lift here) do
> you get what you want.
>
> The other option is to make HMM and dma-buf completely disjoint worlds
> with no overlap, and gang scheduling on the gpu (to guarantee that
> there's never any dma_fence in pending state while an HMM task might
> cause a fault).
>
>> [SNIP]
>>> So where do you want to recurse here?
>> I wasn't aware that without __GFP_FS shrinkers are not called.
> Maybe double check, but that's at least my understanding. GFP flags
> are flags, but in reality it's a strictly nesting hierarchy:
> GFP_KERNEL > GFP_NOFS > GFP_NOIO > GFP_RELCAIM > GFP_ATOMIC (ok atomic
> is special, since it's allowed to dip into emergency reserve).

Going to read myself into that over the holidays.

Christian.