[PATCH v2] drm/amdkfd: kfd open return failed if device is locked

Kuehling, Felix Felix.Kuehling at amd.com
Mon Oct 21 16:46:05 UTC 2019


On 2019-10-18 6:54 p.m., Zeng, Oak wrote:
> In current implementation, even dqm is stopped, user can still create (and start) new queue. This is not correct. We should forbid user create/start new queue if dqm is stopped - stop means stopping the current executing queues and stop receiving new creating request.

Queues being stopped should be no reason not to create new queues. 
Creating a new queue is just allocating a new MQD and populating it. If 
the process is suspended (which it is during reset and suspend and 
evictions), there is no need to touch hardware registers or send an 
updated runlist to the HWS.

When the process is resumed at the end of the reset/suspend/eviction, 
that's when any newly created queues would get mapped to the hardware.

Regards,
   Felix


>
> Regards,
> Oak
>
> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Kuehling, Felix
> Sent: Friday, October 18, 2019 3:08 PM
> To: amd-gfx at lists.freedesktop.org; Yang, Philip <Philip.Yang at amd.com>
> Subject: Re: [PATCH v2] drm/amdkfd: kfd open return failed if device is locked
>
> On 2019-10-18 1:36 p.m., Yang, Philip wrote:
>> If device is locked for suspend and resume, kfd open should return
>> failed -EAGAIN without creating process, otherwise the application
>> exit to release the process will hang to wait for resume is done if
>> the suspend and resume is stuck somewhere. This is backtrace:
>>
>> v2: fix processes that were created before suspend/resume got stuck
>>
>> [Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more
>> than 120 seconds.
>> [Thu Oct 17 16:43:37 2019]       Not tainted
>> 5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1 [Thu Oct 17 16:43:37
>> 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
>> message.
>> [Thu Oct 17 16:43:37 2019] rocminfo        D    0  3024   2947
>> 0x80000000
>> [Thu Oct 17 16:43:37 2019] Call Trace:
>> [Thu Oct 17 16:43:37 2019]  ? __schedule+0x3d9/0x8a0 [Thu Oct 17
>> 16:43:37 2019]  schedule+0x32/0x70 [Thu Oct 17 16:43:37 2019]
>> schedule_preempt_disabled+0xa/0x10
>> [Thu Oct 17 16:43:37 2019]  __mutex_lock.isra.9+0x1e3/0x4e0 [Thu Oct
>> 17 16:43:37 2019]  ? __call_srcu+0x264/0x3b0 [Thu Oct 17 16:43:37
>> 2019]  ? process_termination_cpsch+0x24/0x2f0
>> [amdgpu]
>> [Thu Oct 17 16:43:37 2019]  process_termination_cpsch+0x24/0x2f0
>> [amdgpu]
>> [Thu Oct 17 16:43:37 2019]
>> kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu] [Thu Oct 17
>> 16:43:37 2019]  kfd_process_notifier_release+0x1be/0x220
>> [amdgpu]
>> [Thu Oct 17 16:43:37 2019]  __mmu_notifier_release+0x3e/0xc0 [Thu Oct
>> 17 16:43:37 2019]  exit_mmap+0x160/0x1a0 [Thu Oct 17 16:43:37 2019]  ?
>> __handle_mm_fault+0xba3/0x1200 [Thu Oct 17 16:43:37 2019]  ?
>> exit_robust_list+0x5a/0x110 [Thu Oct 17 16:43:37 2019]
>> mmput+0x4a/0x120 [Thu Oct 17 16:43:37 2019]  do_exit+0x284/0xb20 [Thu
>> Oct 17 16:43:37 2019]  ? handle_mm_fault+0xfa/0x200 [Thu Oct 17
>> 16:43:37 2019]  do_group_exit+0x3a/0xa0 [Thu Oct 17 16:43:37 2019]
>> __x64_sys_exit_group+0x14/0x20 [Thu Oct 17 16:43:37 2019]
>> do_syscall_64+0x4f/0x100 [Thu Oct 17 16:43:37 2019]
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> Signed-off-by: Philip Yang <Philip.Yang at amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdkfd/kfd_chardev.c               | 6 +++---
>>    drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 6 ++++++
>>    2 files changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>> b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>> index d9e36dbf13d5..40d75c39f08e 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>> @@ -120,13 +120,13 @@ static int kfd_open(struct inode *inode, struct file *filep)
>>    		return -EPERM;
>>    	}
>>    
>> +	if (kfd_is_locked())
>> +		return -EAGAIN;
>> +
>>    	process = kfd_create_process(filep);
>>    	if (IS_ERR(process))
>>    		return PTR_ERR(process);
>>    
>> -	if (kfd_is_locked())
>> -		return -EAGAIN;
>> -
>>    	dev_dbg(kfd_device, "process %d opened, compat mode (32 bit) - %d\n",
>>    		process->pasid, process->is_32bit_user_mode);
>>    
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
>> b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
>> index 8509814a6ff0..3784013b92a0 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
>> @@ -128,6 +128,12 @@ void kfd_process_dequeue_from_all_devices(struct kfd_process *p)
>>    {
>>    	struct kfd_process_device *pdd;
>>    
>> +	/* If suspend/resume got stuck, dqm_lock is hold,
>> +	 * skip process_termination_cpsch to avoid deadlock
>> +	 */
>> +	if (kfd_is_locked())
>> +		return;
>> +
> Holding the DQM lock during reset has caused other problems (lock dependency issues and deadlocks) and I was thinking about getting rid of that completely. The intention of holding the DQM lock during reset was to prevent the device queue manager from accessing the CP hardware while a reset was in progress. However, I think there are smarter ways to achieve that. We already get a pre-reset callback (kgd2kfd_pre_reset) that executes the kgd2kfd_suspend, which suspends processes and stops DQM through kfd->dqm->ops.stop(kfd->dqm). This should take care of most of the problem. If there are any places in DQM that try to access the devices, they should add conditions to not access HW while DQM is stopped. Then we could avoid holding a lock indefinitely while a reset is in progress.
>
> The DQM lock is particularly problematic in terms of lock dependencies because it can be taken in MMU notifiers. We want to avoid taking any other locks while holding the DQM lock. Holding the DQM lock for a long time during reset is counterproductive to that objective.
>
> Regards,
>     Felix
>
>
>>    	list_for_each_entry(pdd, &p->per_device_data, per_device_list)
>>    		kfd_process_dequeue_from_device(pdd);
>>    }
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


More information about the amd-gfx mailing list