[PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer

Thu Mar 1 06:01:39 UTC 2018

> Please change the test to use ring->ring_obj instead, this way we still bail out if somebody tries to submit commands before the ring is even allocated.

I don't understand how could fill_buffer() get run under the case that ring->ring_obj is not even allocated ... where is such case ? 

/Monk

-----Original Message-----
From: Koenig, Christian 
Sent: 2018年2月28日 20:46
To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer

Good point, but in this case we need some other handling.

Please change the test to use ring->ring_obj instead, this way we still bail out if somebody tries to submit commands before the ring is even allocated.

And you also need to fix a couple of other places in amdgpu_ttm.c.

Regards,
Christian.

Am 28.02.2018 um 13:34 schrieb Liu, Monk:
> Because when SDMA was hang by like process A, and meanwhile another 
> process B is already running into the code of fill_buffer() So just let process B continue, don't block it otherwise process B would fail by software reason .
>
> Let it run and finally process B's job would fail and GPU recover will 
> repeat it again (since it is a kernel job)
>
> Without this solution other process will be greatly harmed by one 
> black sheep that triggering GPU recover
>
> /Monk
>
>
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken at gmail.com]
> Sent: 2018年2月28日 20:24
> To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready 
> for fill_buffer
>
> Am 28.02.2018 um 08:21 schrieb Monk Liu:
>> because this time SDMA may under GPU RESET so its ring->ready may not 
>> true, keep going and GPU scheduler will reschedule this job if it 
>> failed.
>>
>> give a warning on copy_buffer when go through direct_submit while
>> ring->ready is false
> NAK, that test has already saved us quite a bunch of trouble with the fb layer.
>
> Why exactly are you running into issues with that?
>
> Christian.
>
>> Change-Id: Ife6cd55e0e843d99900e5bed5418499e88633685
>> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 6 +-----
>>    1 file changed, 1 insertion(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> index e38e6db..7b75ac9 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> @@ -1656,6 +1656,7 @@ int amdgpu_copy_buffer(struct amdgpu_ring *ring, uint64_t src_offset,
>>    	amdgpu_ring_pad_ib(ring, &job->ibs[0]);
>>    	WARN_ON(job->ibs[0].length_dw > num_dw);
>>    	if (direct_submit) {
>> +		WARN_ON(!ring->ready);
>>    		r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs,
>>    				       NULL, fence);
>>    		job->fence = dma_fence_get(*fence); @@ -1692,11 +1693,6 @@ int 
>> amdgpu_fill_buffer(struct amdgpu_bo *bo,
>>    	struct amdgpu_job *job;
>>    	int r;
>>    
>> -	if (!ring->ready) {
>> -		DRM_ERROR("Trying to clear memory with ring turned off.\n");
>> -		return -EINVAL;
>> -	}
>> -
>>    	if (bo->tbo.mem.mem_type == TTM_PL_TT) {
>>    		r = amdgpu_ttm_alloc_gart(&bo->tbo);
>>    		if (r)