[Intel-gfx] [REGRESSION] Re: i915 driver crashes on T540p if docking station attached

Thu Jul 30 04:16:17 PDT 2015

On 30 July 2015 at 15:18, Linus Torvalds <torvalds at linux-foundation.org> wrote:
> On Wed, Jul 29, 2015 at 6:39 PM, Theodore Ts'o <tytso at mit.edu> wrote:
>>
>> It's here:  https://goo.gl/photos/xHjn2Z97JQEw6k2C9
>
> You didn't catch enough of the code line to decode the code, but it's
> early enough in drm_crtc_index() (just five bytes in) that it's almost
> certainly the very first dereference, so it's almost guaranteed to be
> that
>
>    crtc->dev
>
> access as part of list_for_each_entry(), with crtc being NULL. And
> yes, "->dev" is the very first field, so the offset is zero too (while
> the "->mode_config" list access would not be at offset zero).
>
> And it looks like it is called from drm_atomic_helper_check_modeset():
> the reason it has a question mark in the backtrace is because the
> fault happens before the stack frame has even been set up.
>
> There are multiple calls to "drm_crtc_index()" from that function, I
> can't tell which one it is. Looking at the code generation I get, I
> think it's because update_connector_routing() gets inlined, and that
> one does several calls. Most of them look like this:
>
>                 if (connector->state->crtc) {
>                         idx = drm_crtc_index(connector->state->crtc);
>
> ie they check that the crtc is non-NULL, but that last one does not:
>
>         connector_state->best_encoder = new_encoder;
>         idx = drm_crtc_index(connector_state->crtc);
>
>         crtc_state = state->crtc_states[idx];
>         crtc_state->mode_changed = true;
>
> and I suspect the fix might be something like the attached. Totally
> untested. Ted?
>
> This whole "atomic modeset" series has been one royal fuck-up, guys.
> We've had too many of these kinds of crap issues.

It hasn't been that bad, on a scale of 1 to MD eats my raid array, I'd
say we are barely at a 5.

There have been a lot of small and seemingly easily fixed teething
problems, essentially rewriting the DRM API to provide a new userspace
API and internal interface, porting some drivers partly to the new
interface, while trying to maintain the old ABI/API on top seamlessly
was always going to be an impossible task. It was never going to
magically all just work in -next and land in your tree fully formed
smelling of lavender and elderberries. This is a massive undertaking,
and doing it over a few kernels was the only possible way it could
ever land.

I think the biggest problem we've had is the QA team at Intel got
reorganised or something right when they really needed to be doing
testing on this stuff, so what was sitting in -next never got as much
testing as it had previously, and you can see that in the types of
cases that are getting through. I think the other thing we can learn
is that when Android forks the kernel we should just say this shit is
too hard, let Google go and create a new API and a complete set of
graphics drivers and deal with it in 10 years, because that was
seriously the only other option.

So yes it's a pity other kernel developers are seeing our fallout, but
I've experienced lots of other kernel developers fall out over the
years, and generally the idea is to get this stuff fixed to a
reasonable state before you release a final kernel.

Note I'm not personally involved in the development for atomic
modesetting at all, I'm running the kernels with it where and when I
can, and I trust the developers who work on it are doing as much as
they can to make it work.

That said hopefully Daniel can find a bag of fucks to debug and write
a proper patch, instead of rage quitting the universe, and just git
reset --hard v4.0 drivers/gpu/drm/i915..

Dave.