[Intel-gfx] 4.10-rc2 oops in DRM connector code

Daniel Vetter daniel at ffwll.ch
Mon Jan 9 16:59:15 UTC 2017

On Mon, Jan 9, 2017 at 5:50 PM, Dave Hansen <dave.hansen at intel.com> wrote:
> On 01/09/2017 08:41 AM, Daniel Vetter wrote:
>> On Mon, Jan 9, 2017 at 2:40 PM, Dave Hansen <dave.hansen at intel.com> wrote:
>>> Well, now I found where the -2 comes from.
>>> intel_dp_register_mst_connector() calls drm_connector_register(), which
>>> fails to add the kobject (warning below).  But, it does zero error
>>> checking on the drm_connector_register() call and leaves the
>>> partially-constructed connector in place.
>>> The next time some poor, hapless code goes and tries to do anything with
>>> that kdev, they oops.  I'm perplexed by this, though.  The
>>> drm_dp_mst_topology_cbs->register_connector just returns void.  It seems
>>> a bit goofy that it can't even _return_ failure.
>>> Is there some stable code to go back to here?  Or, is there something
>>> about my configuration that's unique?  I really wonder why nobody else
>>> is running into this.
>>> There's probably some other race going on here.  This warning doesn't
>>> happen on every boot.
>> This smells more like the root-cause: Something goes wrong on boot
>> that prevents connectors from properly registering, then we fall over
>> later on. And the register callback is intentionally void, assuming
>> that any prep work has been done earlier and that therefore the
>> register step can't fail. Can you pls check whether the oops later on
>> only happens together with this warning at boot, or whether they're
>> not correlated?
> Looking through my logs, I can't find any instance of the oops without
> the warning at boot.  So I do think the later oops is entirely caused by
> the issue warned about in early boot.

Hm, I guess then we'd need to fix that boot-up warning. Can you try to
figure out why it's unhappy? On a hunch it could be that we call
drm_connector_register from the mst probe worker before the main
driver load thread has reached the drm_dev_register call. A few printk
to decide whether that's the case (plus a few boot-up tests to gather
the statistics, sorry about that) would be real great.

If that's inconclusive I'm again a bit low on ideas ...

> My distro kernel (4.4.0-57-generic) is also unstable, but I haven't
> managed to capture a good oops there.  It's hitting this, which I assume
> is unrelated:
>         WARNING: CPU: 0 PID: 41 at /build/linux-lts-xenial-FdAdUy/linux-
>         lts-xenial-4.4.0/ubuntu/i915/intel_pm.c:3675
>         skl_update_other_pipe_wm+0x191/0x1a0 [i915_bpo]()

wm programming issues, which will kill your box. Needs a newer kernel
to fix (both the wm programming issues, and that wm programming issues
lead to system death).
