[PATCH 3/4] drm/radeon: consolidate uvd/vce initialization, resume and suspend.

Wed Mar 16 18:51:57 UTC 2016

Am 16.03.2016 um 18:43 schrieb Jerome Glisse:
> On Wed, Mar 16, 2016 at 06:06:36PM +0100, Christian König wrote:
>> [SNIP]
>> And as I said that is exactly what you should NOT be doing here. Once the
>> firmware is loaded the block should be kept in that state.
>>
>> Freeing the memory allocated for the firmware is also not a good idea at all
>> because we don't know who exactly is accessing it.
> And what i am saying is that block is not responding so it does not matter
> what memory endup there as it only try to read from the firmware afaict.
> So nothing bad will come from that, block is already in non functional
> state what worse can happen ?

That the block is not responding to CPU register accesses just means 
that the SRBM read side interface doesn't work for some reason.

Depending on what the cause is the micro engine usually still happily 
read and write memory when that happens.

When the whole engine is hung you usually don't get working power 
management as well.

>>> [SNIP]
>>>
>>> Again lot simpler to follow control flow than to jump through various level
>>> of if/else. Again uvd and vce tied together (and again i can untie them if
>>> you think it is better to untie them).
>>>
>>> But this time the extra thing is that i properly disable ring if any error
>>> happens while existing code does not.
>> And again that is exactly what we should NOT do.
>>
>> When initialization fails we don't know in which state the ring buffer and
>> micro engines are, so freeing them and giving the space back to be reused in
>> clearly not a good idea.
>>
>> All we should do is clearing the ready flag when something fails to prevent
>> userspace from making command submissions to the failed engine.
> This block only read from that memory ! Never write (unless it modify the fw
> in place which would surprise me). So when block is in bad states it does not
> matter what endup there, block is not processing anything.

No, the memory is quite heavily written as well. Only BAR0 is code, BAR1 
is the heap and BAR2 the stack of the micro engine.

Those are usually heavily accessed by the interrupt handlers even when 
the engine is completely stuck otherwise.

The only way I know to prevent that is to stall the both the register as 
well as the memory bus, wait for a moment and then put the micro engine 
into soft reset. See uvd_v1_0_stop() how that is done.

But as I said after that you can usually forget power management as well.

>>> I am not pasting the init path but it the same logic, tying uvd and vce
>>> together and simplifying error code path.
>>>
>>>
>>>
>>>> Please also note that VCE/UVD has dependencies on power management, so that
>>>> when they are once initialized they should NOT be turned off again.
>>>>
>>>> I only briefly skimmed over your patch, but it actually looks like to me
>>>> that you broken that by trying to cleanup the initialization routine.
>>> I have seen that but assuming Heisenbergs does not get involve, then given
>>> that the block is not responding to register write it is unlikely that thing
>>> will we worse if we try to disable the block. And from my testing it does
>>> not impact power management. My guess is that the block keep reporting it is
>>> busy and that power gating and clock gating are inhibited by that.
>> It's more complicated than that just a simple busy signal. The engines
>> actively communicate with the power management controller to tell them their
>> needs and limits for the clocks based on the workload they have.
>>
>> Once initialized the power management controller expects the UVD and VCE
>> micro-controllers to answer such requests.
>>
>> Failing to do so can get you stuck at a specific power level.
> Yes and what i am saying is that block is in bad states so either it is
> answering always busy and thus stop power level change or it is not answering
> and power level can change. So trying to disable the block leads to the
> exact same situation, either the block shutdown and power level can change
> or it does not and thing are exactly as if we did nothing !

I've tested that multiple times and as long as you don't turn of the 
VCPU completely it's usually still answering the interrupt handler even 
when it is stuck in an endless loop otherwise and not doing anything else.

That's also the reason why it is so dangerous to just put the block into 
reset in this state or touch it otherwise.

You can't stall it because the SRBM interface is borked and you don't 
have access to the registers any more and you can't properly reset it 
either because when you do that in the middle of a memory transaction 
the whole system goes down.

[SNIP]
>> As far as I can see you're actually messing the error handling up quite a
>> bit here instead of improving it.
> I am not ! I AM MAKING IT CLEAR WHAT HAPPENS. YES ON ERROR I TRY TO DISABLE
> THING BUT IF BLOCK IS NOT ANSWERING THIS CAN NOT MAKE THING WORSE.

Calm down, I'm just trying to explain here how the hardware works and 
why your approach can cause trouble as well.

I'm perfectly fine with the cleanups on the control flow and the coding 
style as long as we can properly test them on at least some hardware 
generations.

>> So please describe in detail what the problems you are seeing and why
>> disabling both UVD and VCE helps with them.
> I said it above already, on second part when we fails uvd/vce we do not
> disable ring and things goes from bad to worse.
>
>> A kernel log from a failed suspend/resume cycle would help quite a bit here.
> [   62.111042] [drm] ring test on 5 succeeded in 2 usecs
> [   62.111046] [drm] UVD initialized successfully.
> [   62.111080] [drm] ib test on ring 0 succeeded in 0 usecs
> [   62.111110] [drm] ib test on ring 1 succeeded in 0 usecs
> [   62.111137] [drm] ib test on ring 2 succeeded in 0 usecs
> [   62.111198] [drm] ib test on ring 3 succeeded in 0 usecs
> [   62.111257] [drm] ib test on ring 4 succeeded in 0 usecs
> [   62.309312] usb 1-7: reset high-speed USB device number 3 using xhci_hcd
> [   62.362819] ACPI: \_SB_.PCI0.LPCB.SIO_.COM1: ACPI_NOTIFY_BUS_CHECK event
> [   62.362824] ACPI: \_SB_.PCI0.RP05: ACPI_NOTIFY_BUS_CHECK event
> [   62.362825] ACPI: \_SB_.PCI0.EHC1: ACPI_NOTIFY_BUS_CHECK event
> [   62.362827] ACPI: \_SB_.PCI0.EHC2: ACPI_NOTIFY_BUS_CHECK event
> [   62.626546] usb 1-12: reset full-speed USB device number 4 using xhci_hcd
> [   62.793373] parport_pc 00:08: activated
> [   62.858604] tpm_tis 00:0c: TPM is disabled/deactivated (0x7)
> [   63.625310] psmouse serio2: alps: Unknown ALPS touchpad: E7=10 00 64, EC=10 00 64
> [   64.826916] psmouse serio3: synaptics: queried max coordinates: x [..5660], y [..4730]
> [   64.858652] psmouse serio3: synaptics: queried min coordinates: x [1324..], y [1248..]
> [   72.288487] radeon 0000:01:00.0: ring 5 stalled for more than 10020msec
> [   72.288489] radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000000000002 last fence id 0x0000000000000004 on ring 5)
> [   72.288577] [drm:uvd_v1_0_ib_test [radeon]] *ERROR* radeon: fence wait failed (-35).
> [   72.288594] [drm:radeon_ib_ring_tests [radeon]] *ERROR* radeon: failed testing IB on ring 5 (-35).
> [   72.473034] [drm:si_dpm_set_power_state [radeon]] *ERROR* si_set_sw_state failed
> [   73.452173] [drm:radeon_dp_link_train [radeon]] *ERROR* displayport link status failed
> [   73.452197] [drm:radeon_dp_link_train [radeon]] *ERROR* clock recovery failed
> [   74.296728] [drm:radeon_dp_link_train [radeon]] *ERROR* displayport link status failed
> [   74.296744] [drm:radeon_dp_link_train [radeon]] *ERROR* clock recovery failed
>
>
> So first par of uvd succeed but vce_v1_0_resume() fails and after that uvd fails too
> (vce too). I doubt you can fix that until you can fix the firmware signature check
> issue. What happens is that on resume from hibernation the first kernel load radeon
> and initialize the GPU (this work properly and VCE block is initialized properly).

I'm not sure if that is actually an problem with the firmware signature 
check, but let's focus on the issue at hand first.

> But then it resume the hibernated kernel (kexec) and the radeon resume code path by
> reprogramming the memory configuration of the GPU make the fw unaccessible or at
> different address compare to first boot kernel and thing from that point the vce
> block is in bad state (impacting the uvd block afaict). This trickle down to other
> block like displayport and you endup with no display.

Think about it, that's exactly as I explained above.

The VCPU (UVD) and ECPU (VCE) keeps running after the first boot cycle, 
even when they are not accessible from the CPU any more for some reason.

Because of this the power management can still switch clocks and turn 
blocks on and off. etc... and that's why the memory clock can be raised 
to match your resolution on the screen and DP link training still works.

Now on the second cycle we somehow totally crash them and because of 
that the power management can't raise the clocks any more:
> [   72.473034] [drm:si_dpm_set_power_state [radeon]] *ERROR* si_set_sw_state failed

That results in not enough memory bandwidth or a PLL being powered down 
and so DP link training fails as well.

As a band aid I suggest to just set has_uvd to false when either of the 
vce resume or the uvd resume fails.

Also feel free to cleanup the control flow on coding style as long as 
you don't try to free the memory for the micro engines or try to touch 
the hardware blocks when something fails.

Regards,
Christian.