[PATCH 1/3] drm/amdkfd: support concurrent userptr update for HMM

Thu Mar 7 00:54:19 UTC 2019

Hi Felix,

Thanks, there are other corner cases, call untrack at end of restore 
userptr worker is better place to cleanup. I will submit v2 patch, to 
fix this issue completely.

Philip

On 2019-03-06 3:01 p.m., Kuehling, Felix wrote:
> Hmm, I'm not sure. This change probably fixes this issue, but there may
> be other similar corner cases in other situations where the restore
> worker fails and needs to retry. The better place to call untrack in
> amdgpu_amdkfd_restore_userptr_worker would be at the very end. Anything
> that's left in the userptr_inval_list at that point needs to be untracked.
> 
> For now as a quick fix for an urgent bug, this change is Reviewed-by:
> Felix Kuehling <Felix.Kuehling at amd.com>. But please revisit this and
> check if there are similar corner cases as I explained above.
> 
> Regards,
>     Felix
> 
> On 3/5/2019 1:09 PM, Yang, Philip wrote:
>> Userptr restore may have concurrent userptr invalidation after
>> hmm_vma_fault adds the range to the hmm->ranges list, needs call
>> hmm_vma_range_done to remove the range from hmm->ranges list first,
>> then reschedule the restore worker. Otherwise hmm_vma_fault will add
>> same range to the list, this will cause loop in the list because
>> range->next point to range itself.
>>
>> Add function untrack_invalid_user_pages to reduce code duplication.
>>
>> Change-Id: I31407739dc10554f8e418c7a0e0415d3d95552f1
>> Signed-off-by: Philip Yang <Philip.Yang at amd.com>
>> ---
>>    .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 28 ++++++++++++++-----
>>    1 file changed, 21 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> index 314c048fcac6..783d760ccfe3 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> @@ -1731,6 +1731,23 @@ static int update_invalid_user_pages(struct amdkfd_process_info *process_info,
>>    	return 0;
>>    }
>>    
>> +/* Remove invalid userptr BOs from hmm track list
>> + *
>> + * Stop HMM track the userptr update
>> + */
>> +static void untrack_invalid_user_pages(struct amdkfd_process_info *process_info)
>> +{
>> +	struct kgd_mem *mem, *tmp_mem;
>> +	struct amdgpu_bo *bo;
>> +
>> +	list_for_each_entry_safe(mem, tmp_mem,
>> +				 &process_info->userptr_inval_list,
>> +				 validate_list.head) {
>> +		bo = mem->bo;
>> +		amdgpu_ttm_tt_get_user_pages_done(bo->tbo.ttm);
>> +	}
>> +}
>> +
>>    /* Validate invalid userptr BOs
>>     *
>>     * Validates BOs on the userptr_inval_list, and moves them back to the
>> @@ -1848,12 +1865,7 @@ static int validate_invalid_user_pages(struct amdkfd_process_info *process_info)
>>    out_free:
>>    	kfree(pd_bo_list_entries);
>>    out_no_mem:
>> -	list_for_each_entry_safe(mem, tmp_mem,
>> -				 &process_info->userptr_inval_list,
>> -				 validate_list.head) {
>> -		bo = mem->bo;
>> -		amdgpu_ttm_tt_get_user_pages_done(bo->tbo.ttm);
>> -	}
>> +	untrack_invalid_user_pages(process_info);
>>    
>>    	return ret;
>>    }
>> @@ -1897,8 +1909,10 @@ static void amdgpu_amdkfd_restore_userptr_worker(struct work_struct *work)
>>    	 * and we can just restart the queues.
>>    	 */
>>    	if (!list_empty(&process_info->userptr_inval_list)) {
>> -		if (atomic_read(&process_info->evicted_bos) != evicted_bos)
>> +		if (atomic_read(&process_info->evicted_bos) != evicted_bos) {
>> +			untrack_invalid_user_pages(process_info);
>>    			goto unlock_out; /* Concurrent eviction, try again */
>> +		}
>>    
>>    		if (validate_invalid_user_pages(process_info))
>>    			goto unlock_out;