<html><head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> </head> <body> <div class="moz-cite-prefix">On 2021-04-21 11:29 a.m., Felix Kuehling wrote: </div> <blockquote type="cite" cite="mid:9ea89549-afb4-ed48-4ff2-0ae740251d59@amd.com">On 2021-04-21 3:55 a.m., Christian König wrote: <blockquote type="cite">Am 21.04.21 um 03:20 schrieb Felix Kuehling: <blockquote type="cite">Am 2021-04-20 um 4:21 p.m. schrieb Philip Yang: <blockquote type="cite">Add interface to remove address from fault filter ring by resetting fault ring entry of the fault address timestamp to 0, then future vm fault on the address will be processed to recover. Check fault address from fault ring, add address into fault ring and remove address from fault ring are serialized in same interrupt deferred work, don't have race condition. Signed-off-by: Philip Yang <a class="moz-txt-link-rfc2396E" href="mailto:Philip.Yang@amd.com"><Philip.Yang@amd.com></a> --- drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 24 ++++++++++++++++++++++++ drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h | 2 ++ 2 files changed, 26 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c index c39ed9eb0987..338e45fa66cb 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c @@ -387,6 +387,30 @@ bool amdgpu_gmc_filter_faults(struct amdgpu_device *adev, uint64_t addr, return false; } +/** + * amdgpu_gmc_filter_faults_remove - remove address from VM faults filter + * + * @adev: amdgpu device structure + * @addr: address of the VM fault + * @pasid: PASID of the process causing the fault + * + * Remove the address from fault filter, then future vm fault on this address + * will pass to retry fault handler to recover. + */ +void amdgpu_gmc_filter_faults_remove(struct amdgpu_device *adev, uint64_t addr, + uint16_t pasid) +{ + struct amdgpu_gmc *gmc = &adev->gmc; + + uint64_t key = addr << 4 | pasid; + struct amdgpu_gmc_fault *fault; + uint32_t hash; + + hash = hash_64(key, AMDGPU_GMC_FAULT_HASH_ORDER); + fault = &gmc->fault_ring[gmc->fault_hash[hash].idx]; </blockquote> You need to loop over the fault ring to find a fault with the matching key since there may be hash collisions. You also need to make sure you don't break the single link list of keys with the same hash when you remove an entry. I think the easier way to remove an entry without breaking this ring+closed hashing structure is to reset the fault->key rather than the fault->timestamp. Finally, you need to add locking to the fault ring structure. Currently it's not protected by any locks because only one thread (the interrupt handler) accesses it. Now you have another thread that can remove entries, so you need to protect it with a lock. If you are handling retry faults, you know that the interrupt handler is really a worker thread, so you can use a mutex or a spin-lock, but it doesn't need to be interrupt-safe. </blockquote> I don't think you need a lock at all. Just using cmpxchg() to update the key should do it. Something like this: hash = hash_64(key, AMDGPU_GMC_FAULT_HASH_ORDER); fault = &gmc->fault_ring[gmc->fault_hash[hash].idx]; while (fault->timestamp >= stamp) { uint64_t tmp; cmpxchg(&fault->key, key, 0); </blockquote> Good idea. Then we should probably use READ_ONCE and WRITE_ONCE to access fault->key in other places. </blockquote> I will use spinlock to protect fault ring and fault hash table access, as atomic_cmpxchg cannot work for 52bit key and VG20 need access fault ring in interrupt handler and deferred work. Regards, Philip <blockquote type="cite" cite="mid:9ea89549-afb4-ed48-4ff2-0ae740251d59@amd.com"> Regards, Felix <blockquote type="cite"> tmp = fault->timestamp; fault = &gmc->fault_ring[fault->next]; /* Check if the entry was reused */ if (fault->timestamp >= tmp) break; } Regards, Christian. <blockquote type="cite"> Regards, Felix <blockquote type="cite">+ fault->timestamp = 0; +} + int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev) { int r; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h index 9d11c02a3938..498a7a0d5a9e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h @@ -318,6 +318,8 @@ void amdgpu_gmc_agp_location(struct amdgpu_device *adev, struct amdgpu_gmc *mc); bool amdgpu_gmc_filter_faults(struct amdgpu_device *adev, uint64_t addr, uint16_t pasid, uint64_t timestamp); +void amdgpu_gmc_filter_faults_remove(struct amdgpu_device *adev, uint64_t addr, + uint16_t pasid); int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev); void amdgpu_gmc_ras_fini(struct amdgpu_device *adev); int amdgpu_gmc_allocate_vm_inv_eng(struct amdgpu_device *adev); </blockquote> _______________________________________________ amd-gfx mailing list <a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org">amd-gfx@lists.freedesktop.org</a> <a class="moz-txt-link-freetext" href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=04%7C01%7Cfelix.kuehling%40amd.com%7C7d19870014ff40b6d80b08d9049ae8a0%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637545885642357593%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=b0LnO9SQPgYskIRH0vCjKebvh%2FYzFfidRne15q2WIXw%3D&reserved=0">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=04%7C01%7Cfelix.kuehling%40amd.com%7C7d19870014ff40b6d80b08d9049ae8a0%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637545885642357593%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=b0LnO9SQPgYskIRH0vCjKebvh%2FYzFfidRne15q2WIXw%3D&reserved=0</a> </blockquote> </blockquote> </blockquote> </body> </html>