[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

Wed Aug 14 17:30:55 UTC 2019

https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #100 from Tom B <tom at r.je> ---
I've bee trying to work backwards to find the place where screens get
initialised and eventually call vega20_pre_display_configuration_changed_task. 

vega20_pre_display_configuration_changed_task is exported as
pp_hwmgr_func::display_config_changed

Which is called form hardwaremanager.c:phm_pre_display_configuration_changed 

phm_pre_display_configuration_changed is called from
hwmghr.c:hwmgr_handle_task:

        switch (task_id) {
        case AMD_PP_TASK_DISPLAY_CONFIG_CHANGE:
                ret = phm_pre_display_configuration_changed(hwmgr);

pp_dpm_dispatch_tasks is exported as amd_pm_funcs::dispatch_tasks is called
from amdgpu_dpm_dispatch_task which is called in amdgpu_pm.c:

void amdgpu_pm_compute_clocks(struct amdgpu_device *adev)
{
        int i = 0;

        if (!adev->pm.dpm_enabled)
                return;

        if (adev->mode_info.num_crtc)
                amdgpu_display_bandwidth_update(adev);

        for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
                struct amdgpu_ring *ring = adev->rings[i];
                if (ring && ring->sched.ready)
                        amdgpu_fence_wait_empty(ring);
        }

        if (is_support_sw_smu(adev)) {
                struct smu_context *smu = &adev->smu;
                struct smu_dpm_context *smu_dpm = &adev->smu.smu_dpm;
                mutex_lock(&(smu->mutex));
                smu_handle_task(&adev->smu,
                                smu_dpm->dpm_level,
                                AMD_PP_TASK_DISPLAY_CONFIG_CHANGE);
                mutex_unlock(&(smu->mutex));
        } else {
                if (adev->powerplay.pp_funcs->dispatch_tasks) {
                        if (!amdgpu_device_has_dc_support(adev)) {
                                mutex_lock(&adev->pm.mutex);
                                amdgpu_dpm_get_active_displays(adev);
                                adev->pm.pm_display_cfg.num_display =
adev->pm.dpm.new_active_crtc_count;
                                adev->pm.pm_display_cfg.vrefresh =
amdgpu_dpm_get_vrefresh(adev);
                                adev->pm.pm_display_cfg.min_vblank_time =
amdgpu_dpm_get_vblank_time(adev);
                                /* we have issues with mclk switching with
refresh rates over 120 hz on the non-DC code. */
                                if (adev->pm.pm_display_cfg.vrefresh > 120)
                                        adev->pm.pm_display_cfg.min_vblank_time
= 0;
                                if
(adev->powerplay.pp_funcs->display_configuration_change)

adev->powerplay.pp_funcs->display_configuration_change(

adev->powerplay.pp_handle,

&adev->pm.pm_display_cfg);
                                mutex_unlock(&adev->pm.mutex);
                        }
                        amdgpu_dpm_dispatch_task(adev,
AMD_PP_TASK_DISPLAY_CONFIG_CHANGE, NULL);
                } else {
                        mutex_lock(&adev->pm.mutex);
                        amdgpu_dpm_get_active_displays(adev);
                        amdgpu_dpm_change_power_state_locked(adev);
                        mutex_unlock(&adev->pm.mutex);
                }
        }
}

This is the only place I can see AMD_PP_TASK_DISPLAY_CONFIG_CHANGE being called
from, which eventually is where vega20_pre_display_configuration_changed_task
gets called.

Presumably the code:

        for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
                struct amdgpu_ring *ring = adev->rings[i];
                if (ring && ring->sched.ready)
                        amdgpu_fence_wait_empty(ring);
        }

is what generates 

[    3.683718] amdgpu 0000:44:00.0: ring gfx uses VM inv eng 0 on hub 0
[    3.683719] amdgpu 0000:44:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[    3.683720] amdgpu 0000:44:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[    3.683720] amdgpu 0000:44:00.0: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[    3.683721] amdgpu 0000:44:00.0: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[    3.683722] amdgpu 0000:44:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[    3.683722] amdgpu 0000:44:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[    3.683723] amdgpu 0000:44:00.0: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[    3.683724] amdgpu 0000:44:00.0: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[    3.683724] amdgpu 0000:44:00.0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[    3.683725] amdgpu 0000:44:00.0: ring sdma0 uses VM inv eng 0 on hub 1
[    3.683726] amdgpu 0000:44:00.0: ring page0 uses VM inv eng 1 on hub 1
[    3.683726] amdgpu 0000:44:00.0: ring sdma1 uses VM inv eng 4 on hub 1
[    3.683727] amdgpu 0000:44:00.0: ring page1 uses VM inv eng 5 on hub 1
[    3.683728] amdgpu 0000:44:00.0: ring uvd_0 uses VM inv eng 6 on hub 1
[    3.683728] amdgpu 0000:44:00.0: ring uvd_enc_0.0 uses VM inv eng 7 on hub 1
[    3.683729] amdgpu 0000:44:00.0: ring uvd_enc_0.1 uses VM inv eng 8 on hub 1
[    3.683730] amdgpu 0000:44:00.0: ring uvd_1 uses VM inv eng 9 on hub 1
[    3.683730] amdgpu 0000:44:00.0: ring uvd_enc_1.0 uses VM inv eng 10 on hub
1
[    3.683731] amdgpu 0000:44:00.0: ring uvd_enc_1.1 uses VM inv eng 11 on hub
1
[    3.683731] amdgpu 0000:44:00.0: ring vce0 uses VM inv eng 12 on hub 1
[    3.683732] amdgpu 0000:44:00.0: ring vce1 uses VM inv eng 13 on hub 1
[    3.683733] amdgpu 0000:44:00.0: ring vce2 uses VM inv eng 14 on hub 1

In dmesg. I'll add a pr_err() to verify this.  If so, it means our issue is
introduced somewhere between that for loop and amdgpu_dpm_dispatch_task in this
function. 

amdgpu_pm_compute_clocks is called from
amdgpu_dm_pp_smu.c:dm_pp_apply_display_requirements which is called in
dce_clk_mgr.c in two places: dce_pplib_apply_display_requirements and
dce11_pplib_apply_display_requirements. I don't know which is used for the VII,
I'll add some logging to verify.

But here's something that may be relevant to this bug. In
dce11_pplib_apply_display_requirements there's a check for the number of
displays:

        /* TODO: is this still applicable?*/
        if (pp_display_cfg->display_count == 1) {
                const struct dc_crtc_timing *timing =
                        &context->streams[0]->timing;

                pp_display_cfg->crtc_index =
                        pp_display_cfg->disp_configs[0].pipe_idx;
                pp_display_cfg->line_time_in_us = timing->h_total * 10000 /
timing->pix_clk_100hz;
        }

So there's something that is different when mroe than one display is connected.
That's as far as I got walking backwards through the code. I'll note that this
was also present in 5.0.1, but it could be that something is relying on
ctrc_inxex or line_time_in_us, which wasn't checked previously as these values
only appear to be set if there is a single display.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20190814/c03f3cc4/attachment-0001.html>