[PATCH 2/2] drm/amdgpu: Wait for pte updates before uq_resume

Mon Mar 17 14:40:36 UTC 2025

Hi Christian,

On 3/17/2025 7:33 PM, Christian König wrote:
> Am 16.03.25 um 16:36 schrieb Sathishkumar S:
>> Wait for vm page table updates to finish before resuming user queues.
>> Resume must follow after completion of pte updates to avoid page faults.
>>
>> amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:10 pasid:32771)
>> amdgpu:  in process  pid 0 thread  pid 0)
>> amdgpu:   in page starting at address 0x0000800105405000 from client 10
>> amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00A41051
>> amdgpu:      Faulty UTCL2 client ID: TCP (0x8)
>> amdgpu:      MORE_FAULTS: 0x1
>> amdgpu:      WALKER_ERROR: 0x0
>> amdgpu:      PERMISSION_FAULTS: 0x5
>> amdgpu:      MAPPING_ERROR: 0x0
>> amdgpu:      RW: 0x1
>> amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:10 pasid:32771)
>> amdgpu:  in process  pid 0 thread  pid 0)
>> amdgpu:   in page starting at address 0x0000800105404000 from client 10
>>
>> Signed-off-by: Sathishkumar S <sathishkumar.sundararaju at amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c | 13 +++++++++++++
>>   1 file changed, 13 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c
>> index f1d4e29772a5..4c3edd988a05 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c
>> @@ -541,10 +541,23 @@ amdgpu_userqueue_validate_bos(struct amdgpu_userq_mgr *uq_mgr)
>>   static void amdgpu_userqueue_resume_worker(struct work_struct *work)
>>   {
>>   	struct amdgpu_userq_mgr *uq_mgr = work_to_uq_mgr(work, resume_work.work);
>> +	struct amdgpu_fpriv *fpriv = uq_mgr_to_fpriv(uq_mgr);
>> +	struct amdgpu_eviction_fence_mgr *evf_mgr = &fpriv->evf_mgr;
>> +	struct amdgpu_eviction_fence *ev_fence;
>>   	int ret;
>>   
>>   	mutex_lock(&uq_mgr->userq_mutex);
>>   
>> +	spin_lock(&evf_mgr->ev_fence_lock);
>> +	ev_fence = evf_mgr->ev_fence;
>> +	spin_unlock(&evf_mgr->ev_fence_lock);
>> +	if (ev_fence && dma_fence_is_signaled(&ev_fence->base)) {
>> +	/* Wait for the prior vm updates to complete before proceeding with resume */
>> +		dma_resv_wait_timeout(fpriv->vm.root.bo->tbo.base.resv,
>> +				      DMA_RESV_USAGE_BOOKKEEP,
>> +				      true,
>> +				      msecs_to_jiffies(AMDGPU_FENCE_JIFFIES_TIMEOUT));
>> +	}
> In general I agree that we need for PTE updates before resuming userqueues, but this here is just nonsense.

Okay, but the reason for adding this is, I was able to verify that 
corresponding vm sdma job fence is signaled few microseconds after the 
resume queue call in the timeline in tracing. I verified this by adding 
resume trace point after amdgpu_userqueue_validate_bos has completed 
(hoping this should do the waiting part), but observed that resume 
timestamp is before the sdma job fence signal timestamp.
Whereas after adding the dma_resv_wait on root.bo, the resume happens 
after sdma job fence is signalled, and was also able to see that page 
faults stopped after adding this check. Regards, Sathish
>
>>   	ret = amdgpu_userqueue_validate_bos(uq_mgr);
> This call here is validating the BOs, updating the PTEs *and* making sure that we wait for this update to finish.
>
> Waiting before that just doesn't make any sense as far as I can see.
>
> Regards,
> Christian.
>
>>   	if (ret) {
>>   		DRM_ERROR("Failed to validate BOs to restore\n");