[Nouveau] [REGRESSION] nouveau: Crash in gk104_fifo_intr_runlist()

Tue Aug 11 21:19:48 PDT 2015

Right, that 0xbad0da00 is indicative of something being offline that
should not be at that time. I have sent the revert patch. Thanks Eric
for reporting this!

On Wed, Aug 12, 2015 at 1:00 PM, Ilia Mirkin <imirkin at alum.mit.edu> wrote:
> I'm guessing that optimus is the operative difference, not the
> specific chip. Basically something that can be put to sleep via
> ACPI...
>
> On Tue, Aug 11, 2015 at 11:53 PM, Alexandre Courbot <gnurou at gmail.com> wrote:
>> Sending the revert patch to Dave after receiving his green light for
>> this, and will investigate the issue on my side. I should be able to find a
>> gk107 somewhere...
>>
>> On Wed, Aug 12, 2015 at 12:35 PM, Alexandre Courbot <gnurou at gmail.com> wrote:
>>> Mmm in that case it is probably best to revert that commit for the
>>> time being. It was targeting GM20B (and maybe other Maxwells too) so
>>> reverting it should not hurt anyone at the moment. I think Ben is on
>>> holidays for now, is there anyone else who can send a pull request to
>>> Dave Airlie for this? We don't want 4.2 to ship with a crash every
>>> other reboot...
>>>
>>> On Wed, Aug 12, 2015 at 10:01 AM, Eric Biggers <ebiggers3 at gmail.com> wrote:
>>>> Hi,
>>>>
>>>> I think I've done about 10 reboots with the commit reverted and I never
>>>> experienced the crash.  But with 4.2.0-rc6 I get the crash on about every
>>>> other reboot.
>>>>
>>>> Probably relevant: the computer on which the crash occurs has two GPUs (one
>>>> Intel and one Nvidia).  The Intel one is actually being used, whereas I
>>>> presume the Nvidia one is being automatically disabled shortly after boot,
>>>> perhaps when the crash occurs...
>>>>
>>>> Eric
>>>>
>>>> On Mon, Aug 10, 2015 at 11:28 PM, Alexandre Courbot <gnurou at gmail.com>
>>>> wrote:
>>>>>
>>>>> Indeed, and I am actually surprised to see one here. I will
>>>>> double-check that patch.
>>>>>
>>>>> Eric, would you be able to give an estimate of the repro rate for this
>>>>> issue? More testing with and without the patch would be welcome, it'd
>>>>> be good to know whether it is actually the culprit or not.
>>>>>
>>>>> On Mon, Aug 10, 2015 at 2:28 AM, Ilia Mirkin <imirkin at alum.mit.edu> wrote:
>>>>> > Alexandre, could you take a look? 0xbad* generally comes from bad mmio
>>>>> > reads.
>>>>> >
>>>>> > On Aug 9, 2015 1:08 PM, "Eric Biggers" <ebiggers3 at gmail.com> wrote:
>>>>> >>
>>>>> >> Hi,
>>>>> >>
>>>>> >> I am testing Linux v4.2-rc5 and I am sporadically getting crashes
>>>>> >> shortly
>>>>> >> after
>>>>> >> startup in gk104_fifo_intr_runlist().  What I've found is that the
>>>>> >> 'mask'
>>>>> >> value
>>>>> >> read from offset 0x2a00 comes back as '0xbad0da00'.  This causes the
>>>>> >> 'engn'
>>>>> >> variable to be assigned the value 9, which is invalid; then wake_up()
>>>>> >> is
>>>>> >> called
>>>>> >> on an uninitialized waitqueue which causes the crash.
>>>>> >>
>>>>> >> Reverting commit 1addc12648521d ("drm/nouveau/fifo/gk104: kick channels
>>>>> >> when
>>>>> >> deactivating them") seemed to make the problem go away, although I
>>>>> >> can't
>>>>> >> be 100%
>>>>> >> sure because the problem is sporadic.
>>>>> >>
>>>>> >> Attached an example of the kernel log up to the crash.
>>>>> >>
>>>>> >> Eric
>>>>> >>
>>>>> >> _______________________________________________
>>>>> >> Nouveau mailing list
>>>>> >> Nouveau at lists.freedesktop.org
>>>>> >> http://lists.freedesktop.org/mailman/listinfo/nouveau
>>>>> >>
>>>>> >
>>>>
>>>>