[PATCH v3 05/12] drm/ttm: Expose ttm_tt_unpopulate for driver use

Thu Dec 17 20:10:07 UTC 2020

[SNIP]
>>> By eliminating such users, and replacing them with local maps which
>>>> are strictly bound in how long they can exist (and hence we can
>>>> serialize against them finishing in our hotunplug code).
>>> Not sure I see how serializing against BO map/unmap helps - our 
>>> problem as
>>> you described is that once
>>> device is extracted and then something else quickly takes it's place 
>>> in the
>>> PCI topology
>>> and gets assigned same physical IO ranges, then our driver will 
>>> start accessing this
>>> new device because our 'zombie' BOs are still pointing to those ranges.
>> Until your driver's remove callback is finished the ranges stay 
>> reserved.
>
>
> The ranges stay reserved until unmapped which happens in bo->destroy

I'm not sure of that. Why do you think that?

> which for most internally allocated  buffers is during sw_fini when 
> last drm_put
> is called.
>
>
>> If that's not the case, then hotunplug would be fundamentally impossible
>> ot handle correctly.
>>
>> Of course all the mmio actions will time out, so it might take some time
>> to get through it all.
>
>
> I found that PCI code provides pci_device_is_present function
> we can use to avoid timeouts - it reads device vendor and checks if 
> all 1s is returned
> or not. We can call it from within register accessors before trying 
> read/write

That's way to much overhead! We need to keep that much lower or it will 
result in quite a performance drop.

I suggest to rather think about adding drm_dev_enter/exit guards.

Christian.

>
>>> Another point regarding serializing - problem  is that some of those 
>>> BOs are
>>> very long lived, take for example the HW command
>>> ring buffer Christian mentioned before -
>>> (amdgpu_ring_init->amdgpu_bo_create_kernel), it's life span
>>> is basically for the entire time the device exists, it's destroyed 
>>> only in
>>> the SW fini stage (when last drm_dev
>>> reference is dropped) and so should I grab it's dma_resv_lock from
>>> amdgpu_pci_remove code and wait
>>> for it to be unmapped before proceeding with the PCI remove code ? 
>>> This can
>>> take unbound time and that why I don't understand
>>> how serializing will help.
>> Uh you need to untangle that. After hw cleanup is done no one is allowed
>> to touch that ringbuffer bo anymore from the kernel.
>
>
> I would assume we are not allowed to touch it once we identified the 
> device is
> gone in order to minimize the chance of accidental writes to some 
> other device which might now
> occupy those IO ranges ?
>
>
>>   That's what
>> drm_dev_enter/exit guards are for. Like you say we cant wait for all sw
>> references to disappear.
>
>
> Yes, didn't make sense to me why would we use vmap_local for internally
> allocated buffers. I think we should also guard registers read/writes 
> for the
> same reason as above.
>
>
>>
>> The vmap_local is for mappings done by other drivers, through the 
>> dma-buf
>> interface (where "other drivers" can include fbdev/fbcon, if you use the
>> generic helpers).
>> -Daniel
>
>
> Ok, so I assumed that with vmap_local you were trying to solve the 
> problem of quick reinsertion
> of another device into same MMIO range that my driver still points too 
> but actually are you trying to solve
> the issue of exported dma buffers outliving the device ? For this we 
> have drm_device refcount in the GEM layer
> i think.
>
> Andrey
>
>
>>
>>> Andrey
>>>
>>>
>>>> It doesn't
>>>> solve all your problems, but it's a tool to get there.
>>>> -Daniel
>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>> - handle fbcon somehow. I think shutting it all down should work 
>>>>>> out.
>>>>>> - worst case keep the system backing storage around for shared 
>>>>>> dma-buf
>>>>>> until the other non-dynamic driver releases it. for vram we require
>>>>>> dynamic importers (and maybe it wasn't such a bright idea to allow
>>>>>> pinning of importer buffers, might need to revisit that).
>>>>>>
>>>>>> Cheers, Daniel
>>>>>>
>>>>>>> Christian.
>>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>>
>>>>>>>>> -Daniel
>>>>>>>>>
>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>>> I loaded the driver with vm_update_mode=3
>>>>>>>>>>> meaning all VM updates done using CPU and hasn't seen any 
>>>>>>>>>>> OOPs after
>>>>>>>>>>> removing the device. I guess i can test it more by 
>>>>>>>>>>> allocating GTT and
>>>>>>>>>>> VRAM BOs
>>>>>>>>>>> and trying to read/write to them after device is removed.
>>>>>>>>>>>
>>>>>>>>>>> Andrey
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Christian.
>>>>>>>>>>>>
>>>>>>>>>>>>> Andrey
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C92654f053679415de74808d8a2838b3e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637438033181843512%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2BeS7v5CrHRfblj2FFCd4nrDLxUxzam6EyHM6poPkGc4%3D&reserved=0 
>>>>>>>>>>>
>>>>>>>>>>>
>>>>