[Freedreno] drm/msm: clock issue

Mon Oct 30 15:55:14 UTC 2017

I hit on a issue with the db820c while testing the 4.15 next push on Friday and
I wanted to write down what I've see so far.

While running a series of tests, I was getting a AHB timeout that resulted in a
reboot (which is typically a unclocked access). A bit of experimentation seemed
to point the finger at pm_runtime. Our current code has a
pm_runtime_get_sync() / pm_runtime_put_sync() wrapping iommu_unmap calls. This
was to ensure that the iommu was clocked for during unmap so it can do a TLB
invalidate. Removing these calls eliminated the problem.

Next, I tried removing the code that disables the GPU clock/regulator disable
from the pm_runtime callbacks. This still resulted in a AHB timeout but then I
discovered that returning error from the PM runtime call *does* work (doesn't
hang).

I then turned on tracepoints for pm runtime and inserted a few trace_printk()
calls in our code. This is the pm_runtime_put_sync() called from
msm_iommu_unmap(). You will note that it is linked to the iommu driver through
a device_link:

    drm-api-test-5032  [003] ....   142.560420: msm_iommu_unmap: msm_iommu_unmap: pm_runtime_put_sync
    drm-api-test-5032  [003] d..1   142.560422: rpm_idle: b00000.gpu flags-4 cnt-0  dep-0  auto-1 p-0 irq-0 child-0
    drm-api-test-5032  [003] d..1   142.560423: rpm_return_int: rpm_idle+0x3c/0x380:b00000.gpu ret=0
    drm-api-test-5032  [003] d..1   142.560425: rpm_suspend: b00000.gpu flags-c cnt-0  dep-0  auto-1 p-0 irq-0 child-0
    drm-api-test-5032  [003] d..1   142.567942: rpm_idle: b40000.arm,smmu flags-5 cnt-0  dep-0  auto-1 p-0 irq-0 child-0
    drm-api-test-5032  [003] dn.1   142.567949: rpm_return_int: rpm_idle+0x214/0x380:b40000.arm,smmu ret=0
     kworker/3:0-26    [003] d..1   142.567965: rpm_idle: b40000.arm,smmu flags-2 cnt-0  dep-0  auto-1 p-0 irq-0 child-0
     kworker/3:0-26    [003] d..1   142.567968: rpm_return_int: rpm_idle+0x3c/0x380:b40000.arm,smmu ret=0
     kworker/3:0-26    [003] d..1   142.567969: rpm_suspend: b40000.arm,smmu flags-a cnt-0  dep-0  auto-1 p-0 irq-0 child-0
     kworker/3:0-26    [003] d..1   142.567978: rpm_idle: soc flags-1 cnt-0  dep-1  auto-1 p-0 irq-0 child-7
     kworker/3:0-26    [003] d..1   142.567979: rpm_return_int: rpm_idle+0x3c/0x380:soc ret=-13
     kworker/3:0-26    [003] d..1   142.567980: rpm_return_int: rpm_suspend+0x158/0x670:b40000.arm,smmu ret=0
    drm-api-test-5032  [003] d..1   142.567986: rpm_idle: soc flags-1 cnt-0  dep-1  auto-1 p-0 irq-0 child-6
    drm-api-test-5032  [003] d..1   142.567987: rpm_return_int: rpm_idle+0x3c/0x380:soc ret=-13
    drm-api-test-5032  [003] d..1   142.567988: rpm_return_int: rpm_suspend+0x158/0x670:b00000.gpu ret=0

And then the test tries to do a submission about 3 milliseconds later:

    drm-api-test-5032  [003] ....   142.570752: msm_gpu_submit: msm_gpu_submit: pm_runtime_get_sync
    drm-api-test-5032  [003] d..1   142.570760: rpm_resume: b00000.gpu flags-4 cnt-1  dep-0  auto-1 p-0 irq-0 child-0
    drm-api-test-5032  [003] d..1   142.570778: rpm_resume: b40000.arm,smmu flags-4 cnt-1  dep-0  auto-1 p-0 irq-0 child-0
    drm-api-test-5032  [003] d..1   142.571135: rpm_idle: b40000.arm,smmu flags-1 cnt-1  dep-0  auto-1 p-0 irq-0 child-0
    drm-api-test-5032  [003] d..1   142.571138: rpm_return_int: rpm_idle+0x3c/0x380:b40000.arm,smmu ret=-11
    drm-api-test-5032  [003] d..1   142.571146: rpm_return_int: rpm_resume+0x110/0x6b8:b40000.arm,smmu ret=0
    drm-api-test-5032  [003] ....   142.571920: msm_gpu_pm_resume: enable
    drm-api-test-5032  [003] d..1   142.575587: rpm_idle: b00000.gpu flags-1 cnt-1  dep-0  auto-1 p-0 irq-0 child-0
    drm-api-test-5032  [003] d..1   142.575590: rpm_return_int: rpm_idle+0x3c/0x380:b00000.gpu ret=-11
    drm-api-test-5032  [003] d..1   142.575600: rpm_idle: soc flags-5 cnt-0  dep-1  auto-1 p-0 irq-0 child-8
    drm-api-test-5032  [003] d..1   142.575601: rpm_return_int: rpm_idle+0x3c/0x380:soc ret=-13
    drm-api-test-5032  [003] d..1   142.575602: rpm_return_int: rpm_resume+0x110/0x6b8:b00000.gpu ret=0

Immediately following resume, the code goes into the GPU hardware init which
starts writing to the hardware. The kernel log from the same run as above shows
that we go into HW init immediately after resume returns.

[  142.575607] START HW_INIT

Rob and I ran tests that show we through several hardware writes in the init
before the AHB timeout.

My best guess from all of this is that the pm_runtime_put_sync() is triggering a
timeout somewhere up the clock tree and that timeout is not being canceled by
the subsequent pm_runtime_get_sync(). Despite the fact that the clocks are
enabled by resume (and also, in this particular test run, I don't ever disable
clocks) some other clock is being turned off at some point between the
rpm_suspend at 142.567988 and several microseconds into the hardware init at
some pointer after 142.575607.

I'm going to continue investigating but this isn't my area of expertise so if
any of the clock folks out there think they might know whats going on please
speak up here or in IRC.

Thanks,
Jordan

-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project