[PATCH] drm/amdgpu: Fix S3 resume failre.

Thu Jul 19 22:45:55 UTC 2018

On 07/19/2018 04:17 PM, Andrey Grodzovsky wrote:
>
>
> On 07/19/2018 03:37 PM, Harry Wentland wrote:
>> On 2018-07-19 02:30 PM, Alex Deucher wrote:
>>> On Thu, Jul 19, 2018 at 1:07 PM, Andrey Grodzovsky
>>> <Andrey.Grodzovsky at amd.com> wrote:
>>>>
>>>> On 07/19/2018 12:59 PM, Michel Dänzer wrote:
>>>>> On 2018-07-19 06:53 PM, Andrey Grodzovsky wrote:
>>>>>>
>>>>>> On 07/19/2018 12:47 PM, Michel Dänzer wrote:
>>>>>>> On 2018-07-19 06:33 PM, Andrey Grodzovsky wrote:
>>>>>>>> On 07/19/2018 11:39 AM, Ville Syrjälä wrote:
>>>>>>>>> On Thu, Jul 19, 2018 at 11:19:56AM -0400, Andrey Grodzovsky 
>>>>>>>>> wrote:
>>>>>>>>>> Problem:
>>>>>>>>>> FB is still not unpinned during the first run of 
>>>>>>>>>> amdgpu_bo_evict_vram
>>>>>>>>>> and so it's left for the second run, but during second run 
>>>>>>>>>> the SDMA
>>>>>>>>>> for
>>>>>>>>>> moving buffer around already disabled and you have to do
>>>>>>>>>> it with CPU, but FB is not in visible VRAM and hence the 
>>>>>>>>>> eviction
>>>>>>>>>> failure
>>>>>>>>>> leading later to resume failure.
>>>>>>>>>>
>>>>>>>>>> Fix:
>>>>>>>>>> When DAL in use get a pointer to FB from crtc->primary->state 
>>>>>>>>>> rather
>>>>>>>>>> then from crtc->primary which is not set for DAL since it 
>>>>>>>>>> supports
>>>>>>>>>> atomic KMS.
>>>>>>>>>>
>>>>>>>>>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107065
>>>>>>>>>> Fixes e00fb85 drm: Stop updating plane->crtc/fb/old_fb on atomic
>>>>>>>>>> drivers
>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>>>>>>>>> ---
>>>>>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ++-
>>>>>>>>>>      1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> index 709e4a3..dd9ebf7 100644
>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> @@ -2642,7 +2642,8 @@ int amdgpu_device_suspend(struct 
>>>>>>>>>> drm_device
>>>>>>>>>> *dev, bool suspend, bool fbcon)
>>>>>>>>>>          /* unpin the front buffers and cursors */
>>>>>>>>>>          list_for_each_entry(crtc, 
>>>>>>>>>> &dev->mode_config.crtc_list, head)
>>>>>>>>>> {
>>>>>>>>>>              struct amdgpu_crtc *amdgpu_crtc = 
>>>>>>>>>> to_amdgpu_crtc(crtc);
>>>>>>>>>> -        struct drm_framebuffer *fb = crtc->primary->fb;
>>>>>>>>>> +         struct drm_framebuffer *fb =
>>>>>>>>>> amdgpu_device_has_dc_support(adev) ?
>>>>>>>>>> + crtc->primary->state->fb : crtc->primary->fb;
>>>>>>>>> So apparently you haven't yet turned off the planes here. If I'm
>>>>>>>>> reading things right amdgpu_device_ip_suspend() should end up 
>>>>>>>>> doing
>>>>>>>>> that through drm_atomic_helper_suspend(). So it looks like 
>>>>>>>>> like now
>>>>>>>>> you'll end up unpinning the same bos twice. Doesn't that mess up
>>>>>>>>> some kind of refcount or something?
>>>>>>>> amdgpu_bo_unpin has a guard against that, amdgpu_bo_unreserve 
>>>>>>>> is less
>>>>>>>> clear.
>>>>>>> BO reservation shouldn't an issue here, BOs are only reserved for a
>>>>>>> short time around (un)pinning them.
>>>>>>>
>>>>>>>
>>>>>>>>> To me it would seem better to susped the display before trying
>>>>>>>>> to evict the bos.
>>>>>>>> Yea, i was aware of that and indeed DAL shouldn't rely on the 
>>>>>>>> code in
>>>>>>>> amdgpu_device_suspend to unpin
>>>>>>>> front buffer and cursor since the atomic helper should do it. 
>>>>>>>> Problem
>>>>>>>> is
>>>>>>>> that during amdgpu_device_ip_suspend
>>>>>>>> the SDMA engine gets suspended too, so you have to embed another
>>>>>>>> eviction in between, after display is suspended but before
>>>>>>>> SDMA and this forces ordering between them which kind of 
>>>>>>>> already in
>>>>>>>> place (amd_ip_block_type) but still it's an extra constrain.
>>>>>>> Ville's point (which I basically agree with) is that the display
>>>>>>> hardware should be turned off before evicting VRAM the first 
>>>>>>> time, in
>>>>>>> which case no second eviction should be necessary (for this 
>>>>>>> purpose).
>>>>>> Display HW is turned off as part of all IPs in a loop inside
>>>>>> amdgpu_device_ip_suspend.
>>>>>> Are you suggesting to extract the  display HW turn off from inside
>>>>>> amdgpu_device_ip_suspend and place it
>>>>>> before the first call to amdgpu_bo_evict_vram ?
>>>>> In a nutshell, yes.
>>>>>
>>>>> Or maybe it would be easier to move the amdgpu_bo_evict_vram call 
>>>>> down
>>>>> to somewhere called from amdgpu_device_ip_suspend?
>>>>
>>>> I can move the BEFORE and AFTER calls to amdgpu_bo_evict_vram inside
>>>> amdgpu_device_ip_suspend
>>>> such that the first one is called AFTER display is shut off, while the
>>>> second in the very end of the function.
>>>> I am just not sure what's gonna be the side effect of evicting 
>>>> after bunch
>>>> of blocks (such as GMC) are already disabled.
>>> How about something like the attached patches?  Basically split the ip
>>> suspend sequence in two like we do for resume.
>>>
>> Patches are
>> Acked-by: Harry Wentland <harry.wentland at amd.com>
>>
>> Harry
>>
>>> Alex
>
> Patches look good indeed but on second S3 in a raw i get a warning in 
> dma_fence_is_later
> about fence contexts not equal. I will have to take a look why is that.
>
> Andrey

Seems like amdgpu_ttm_set_buffer_funcs_status destroys adev->mman.entity 
on suspend without releasing
adev->mman.bdev.man[TTM_PL_VRAM].move fence so on resume the new 
drm_sched_entity.fence_context causes
the warning against the old fence context which is different.
BTW, happens with my original change to, I just haven't noticed.

Andrey

>
>>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>