[Nouveau] nouveau TRAP_M2MF still there on G98

Wed Apr 4 23:03:39 UTC 2018

On Wed, Apr 4, 2018 at 6:58 PM, Adam Borowski <kilobyte at angband.pl> wrote:
> On Wed, Apr 04, 2018 at 03:48:39PM +0300, Māris Nartišs wrote:
>> 2018-04-03 23:00 GMT+03:00, Adam Borowski <kilobyte at angband.pl>:
>> > In commit da5e45e619b3f101420c38b3006a9ae4f3ad19b0
>> >
>> > yet it is still reproducible for me on 4.16-rc7 and 4.16.0, which already
>> > have your fix.  I don't know about earlier versions -- my newer card went
>> > into flames just a few days ago, and I replaced it a brand new 8400GS (G98)
>> > I happened to have in a dusty closet.  Obviously, I can bisect if that
>> > would be helpful, but the error looks the same thus I'm reporting first.
>>
>> Unfortunately I will not be able to help you, as patch fixed issue on
>> my system and thus I have no means to test anything more. My card is
>> G98M [Quadro NVS 160M]. Besides – I'm a geographer not a programmer
>> ;-)
>
> And I'm, it seems, servant of a particular cat, all else being secondary. :p
>
>> Still your report makes to question the original commit I was fixing
>> (mmu: swap out round for ALIGN). Could you test if going back to
>> rounddown fixes problem on your side?
>>
>> --- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
>> +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
>> @@ -1354,7 +1354,7 @@ nvkm_vmm_get_locked(struct nvkm_vmm *vmm, bool
>> getref, bool mapref, bool sparse,
>>
>>                 tail = this->addr + this->size;
>>                 if (vmm->func->page_block && next && next->page != p)
>> -                       tail = ALIGN_DOWN(tail, vmm->func->page_block);
>> +                       tail = rounddown(tail, vmm->func->page_block);
>>
>>                 if (addr <= tail && tail - addr >= size) {
>>                         rb_erase(&this->tree, &vmm->free);
>>
>
> Alas, it did work for a few hours, then a total display freeze:
>
> [29982.011795] nouveau 0000:01:00.0: fifo: DMA_PUSHER - ch 2 [Xorg[2667]] get
> 0000037d90 put 000003a2cc ib_get 000001dc ib_put 000001dd state 80004861 (err:
> INVALID_CMD) push 00704031
> [29982.027959] nouveau 0000:01:00.0: fifo: DMA_PUSHER - ch 2 [Xorg[2667]] get
> 000003a2cc put 000003a2cc ib_get 000001dc ib_put 000001f9 state 80000000 (err:
> INVALID_CMD) push 00406040

These, as I call them, 406040 errors, have been around on Tesla for
ages. We have no idea what leads to them, but generally some kind of
fifo desync appears to follow.

  -ilia