[Freedreno] stable-rc 4.14: arm64: Internal error: Oops: clk_reparent __clk_set_parent_before on db410c

Arnd Bergmann arnd at arndb.de
Tue Jul 21 09:51:32 UTC 2020


On Tue, Jul 21, 2020 at 10:32 AM Naresh Kamboju
<naresh.kamboju at linaro.org> wrote:
>
> Kernel Internal oops while booting stable-rc 4.14 kernel on qcom db410c device
> this problem happened only once on this specific platform.
> and  rcu_preempt detected stalls on CPUs/tasks detected after this and
> board hung.
>
> metadata:
>   git branch: linux-4.14.y
>   git repo: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
>   git commit: 5b1e982af0f810358664827a6333affb4f5d8eb5
>   git describe: v4.14.188-126-g5b1e982af0f8
>   make_kernelversion: 4.14.189-rc1
>   kernel-config:
> https://builds.tuxbuild.com/12PM71zBW-5EAp5ztC_yxg/kernel.config

I looked at the object file at
https://builds.tuxbuild.com/12PM71zBW-5EAp5ztC_yxg/vmlinux.xz

> Crash dump:
>
> [    5.424958] Unable to handle kernel paging request at virtual
> address 00001008
> [    5.435485] Mem abort info:
> [    5.442509]   Exception class = DABT (current EL), IL = 32 bits
> [    5.445203]   SET = 0, FnV = 0
> [[    5.451101]   EA = 0, S1PTW =[    5.454226] Data abort info:
> [    5.457264]   ISV = 0, ISS = 0x00000044
> [    5.460390]   CM = 0, WnR = 1
> [    5.463951] user pgtable: 4k pages, 48-bit VAs, pgd = ffff80003d66d000
> [    5.467078] [0000000000001008] *pgd=0000000000000000
> [    5.473503] Internal error: Oops: 96000044 [#1] PREEMPT SMP
> [    5.479838] Modules linked in: adv7511 msm mdt_loader msm_rng
> drm_kms_helper rng_core drm fuse
> [    5.485405] Process kworker/2:0 (pid: 21, stack limit = 0xffff000009450000)
> [    5.494090] CPU: 2 PID: 21 Comm: kworker/2:0 Not tainted 4.14.189-rc1 #1
> [    5.501036] Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT)
> [    5.507996] Workqueue: events deferred_probe_work_func
> [    5.514935] task: ffff80003d454380 task.stack: ffff000009450000
> [    5.520155] pc : clk_reparent+0x60/0xe8
> [    5.526058] lr : __clk_set_parent_before+0x40/0x88
> [    5.529882] sp : ffff000009453640 pstate : 800001c5
> [    5.534748] x29: ffff000009453640 x28: ffff0000090b7000
> [    5.539615] x27: ffff80003fe7c478 x26: ffff0000094537a8
> [    5.545175] x25: 0000000000000001 x24: ffff000009239038
> [    5.550736] x23: ffff80003b6be688 x22: 0000000000000000
> [    5.556297] x21: 0000000000000000 x20: ffff80003c9d8c00
> [    5.561858] x19: ffff80003d798900 x18: 00000000fffffffe
> [    5.567419] x17: 0000ffff7fdbb6a0 x16: ffff00000821ad98
> [    5.572980] x15: 0000000000000001 x14: ffffffffffffffff
> [    5.578540] x13: ffff0000094537c8 x12: 0000000000000010
> [    5.584102] x11: 0000000000000010 x10: 0101010101010101
> [    5.589663] x9 : 0000000000000000 x8 : 7f7f7f7f7f7f7f7f
> [    5.595223] x7 : fefefefefeff6e77 x6 : 0000000000000140
> [    5.600784] x5 : 0000000000000001 x4 : ffff80003c9d8c00
> [    5.606344] x3 : ffff80003d798900 x2 : 0000000000000004
> [    5.611905] x1 : ffff80003d7989a8 x0 : 0000000000001000
> [    5.617467] Call trace:
> [    5.623030]  clk_reparent+0x60/0xe8

We are here:

static void clk_reparent(struct clk_core *core, struct clk_core *new_parent)
{
        bool was_orphan = core->orphan;

        hlist_del(&core->child_node);

        if (new_parent) {
                bool becomes_orphan = new_parent->orphan;

                /* avoid duplicate POST_RATE_CHANGE notifications */
                if (new_parent->new_child == core)
                        new_parent->new_child = NULL;

                hlist_add_head(&core->child_node, &new_parent->children);


and "new_parent->children.first" points to 0x1000, which is not
a valid pointer, so the "first->pprev = &n->next" in hlist_add_head()
causes a crash.

> [    5.625465]  __clk_set_parent_before+0x40/0x88
> [    5.628943]  clk_register+0x330/0x618

(speculation) the most likely call chain here would be __clk_core_init(),
implying that there is some random other orphan clock that has a
broken parent:

        /*
         * walk the list of orphan clocks and reparent any that newly finds a
         * parent.
         */
        hlist_for_each_entry_safe(orphan, tmp2, &clk_orphan_list, child_node) {
                struct clk_core *parent = __clk_init_parent(orphan);

                /*
                 * We need to use __clk_set_parent_before() and _after() to
                 * to properly migrate any prepare/enable count of the orphan
                 * clock. This is important for CLK_IS_CRITICAL clocks, which
                 * are enabled during init but might not have a parent yet.
                 */
                if (parent) {
                        /* update the clk tree topology */
                        __clk_set_parent_before(orphan, parent);

None of the above have changed in stable kernels.

> [    5.633668]  pll_28nm_register+0xa4/0x340 [msm]
> [    5.637492]  msm_dsi_pll_28nm_init+0xc8/0x1d8 [msm]
> [    5.642007]  msm_dsi_pll_init+0x34/0xe0 [msm]
> [    5.646870]  dsi_phy_driver_probe+0x1cc/0x310 [msm]

The only changes to the dsi driver in v4.14-stable were:

89e30bb46074 drm/msm/dsi: save pll state before dsi host is powered off
892afde0f4a1 drm: msm: Fix return type of dsi_mgr_connector_mode_valid for kCFI
35ff594b0da2 drm/msm/dsi: Implement reset correctly
5151a0c8d730 drm/msm/dsi: use correct enum in dsi_get_cmd_fmt
e6bc3a4b0c23 clk: divider: fix incorrect usage of container_of

None of these look suspicious to me.

      Arnd

> [    5.651196]  platform_drv_probe+0x58/0xb8
> [    5.656060]  driver_probe_device+0x228/0x2d0
> [    5.660231]  __device_attach_driver+0xb8/0xe8
> [    5.664750]  bus_for_each_drv+0x64/0xa0
> [    5.669269]  __device_attach+0xcc/0x138
> [    5.673093]  device_initial_probe+0x10/0x18
> [    5.676918]  bus_probe_device+0x90/0x98
> [    5.681088]  device_add+0x3c4/0x5a8
> [    5.684915]  of_device_add+0x40/0x58
> [    5.688392]  of_platform_device_create_pdata+0x80/0xe8
> [    5.692219]  of_platform_bus_create+0xd4/0x308
> [    5.697432]  of_platform_populate+0x48/0xb8
> [    5.702143]  msm_pdev_probe+0x3c/0x328 [msm]
> [    5.706125]  platform_drv_probe+0x58/0xb8
> [    5.710642]  driver_probe_device+0x228/0x2d0
> [    5.714814]  __device_attach_driver+0xb8/0xe8
> [    5.719334]  bus_for_each_drv+0x64/0xa0
> [    5.723852]  __device_attach+0xcc/0x138
> [    5.727677]  device_initial_probe+0x10/0x18
> [    5.731502]  bus_probe_device+0x90/0x98
> [    5.735675]  deferred_probe_work_func+0xa4/0x140

It's worth noting that this is a deferred probe, so maybe
there is a race in the clk registration that gets hit if two drivers
register a clock at the same time?

> [    5.739502]  process_one_work+0x19c/0x300
> [    5.744366]  worker_thread+0x4c/0x420
> [    5.748539]  kthread+0x100/0x130
> [    5.752362]  ret_from_fork+0x10/0x1c
> [    5.755842] Code: 54000260 f9405080 f9005460 b4000040 (f9000401)
> [    5.759669] ---[ end trace 6d70d7dd8a236384 ]---
> [    5.765922] note: kworker/2:0[21] exited with preempt_count 1
> [   26.777168] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [   26.777204] 0-...: (1 GPs behind) idle=2fa/140000000000001/0
> softirq=1679/1723 fqs=2625
> [   26.783112] 1-...: (1 GPs behind) idle=53a/140000000000000/0
> softirq=1946/1958 fqs=2625
> [   26.791444] (detected by 3, t=5252 jiffies, g=58, c=57, q=2362)
> [   26.799781] Task dump for CPU 0:
> [   26.806033] systemd-udevd   R  running task        0  2533      1 0x00000202
> [   26.809515] Call trace:
> [   26.816814]  __switch_to+0xe8/0x148
> [   26.819248]  __wake_up_common+0x80/0x170
> [   26.822724]  __wake_up_common_lock+0x7c/0xa8
> [   26.826897]  __wake_up_sync_key+0x1c/0x28
> [   26.831416]  sock_def_readable+0x40/0x88
> [   26.835584] Task dump for CPU 1:
> [   26.839755] systemd         R  running task        0     1      0 0x0000000a
> [   26.843237] Call trace:
> [   26.850531]  __switch_to+0xe8/0x148
> [   26.852968]  0xffff800009fc8000
> [   26.856467] INFO: rcu_sched detected stalls on CPUs/tasks:
> [   26.859584] 0-...: (1 GPs behind) idle=2fa/140000000000001/0
> softirq=1722/1723 fqs=2625
> [   26.865145] 1-...: (1 GPs behind) idle=53a/140000000000000/0
> softirq=1957/1958 fqs=2625
> [   26.873477] (detected by 3, t=5270 jiffies, g=-150, c=-151, q=3)
> [   26.881815] Task dump for CPU 0:
> [   26.888067] systemd-udevd   R  running task        0  2533      1 0x00000202
> [   26.891549] Call trace:
> [   26.898845]  __switch_to+0xe8/0x148
> [   26.901281]  __wake_up_common+0x80/0x170
> [   26.904758]  __wake_up_common_lock+0x7c/0xa8
> [   26.908931]  __wake_up_sync_key+0x1c/0x28
> [   26.913449]  sock_def_readable+0x40/0x88
> [   26.917618] Task dump for CPU 1:
> [   26.921790] systemd         R  running task        0     1      0 0x0000000a
> [   26.925271] Call trace:
> [   26.932565]  __switch_to+0xe8/0x148
> [   26.935002]  0xffff800009fc8000
> [   41.449201] random: crng init done
> [   41.449221] random: 7 urandom warning(s) missed due to ratelimiting
> [   89.797164] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [   89.797195] 0-...: (1 GPs behind) idle=2fa/140000000000001/0
> softirq=1679/1723 fqs=10482
> [   89.803103] 1-...: (1 GPs behind) idle=53a/140000000000000/0
> softirq=1946/1958 fqs=10482
> [   89.811437] (detected by 3, t=21007 jiffies, g=58, c=57, q=2578)
> [   89.819773] Task dump for CPU 0:
> [   89.826027] systemd-udevd   R  running task        0  2533      1 0x00000202
> [   89.829508] Call trace:
> [   89.836807]  __switch_to+0xe8/0x148
> [   89.839241]  __wake_up_common+0x80/0x170
> [   89.842717]  __wake_up_common_lock+0x7c/0xa8
> [   89.846891]  __wake_up_sync_key+0x1c/0x28
> [   89.851410]  sock_def_readable+0x40/0x88
> <System Hung>
>
> Full test log,
> https://qa-reports.linaro.org/lkft/linux-stable-rc-4.14-oe/build/v4.14.188-126-g5b1e982af0f8/testrun/2969667/suite/linux-log-parser/test/check-kernel-oops-1592908/log
>
>
> --
> Linaro LKFT
> https://lkft.linaro.org


More information about the Freedreno mailing list