[Nouveau] CCACHE and VFETCH FAULTs causing lockups

Tue Mar 8 15:06:22 PST 2011

On Tue, Mar 8, 2011 at 10:44 PM, Maarten Maathuis <madman2003 at gmail.com> wrote:
> On Mon, Mar 7, 2011 at 10:22 PM, Ben Skeggs <skeggsb at gmail.com> wrote:
>> On Mon, 2011-03-07 at 21:51 +0000, Maarten Maathuis wrote:
>>> On Sun, Mar 6, 2011 at 2:24 PM, Ben Skeggs <skeggsb at gmail.com> wrote:
>>> >
>>> >
>>> > Sent from my iPhone
>>> >
>>> > On 07/03/2011, at 0:03, Maarten Maathuis <madman2003 at gmail.com> wrote:
>>> >
>>> >> On Sun, Mar 6, 2011 at 1:44 PM, Ben Skeggs <skeggsb at gmail.com> wrote:
>>> >>> Sorry for the top posting, it's late and typing from my phone in bed lol.
>>> >>>
>>> >>> Just wanted to see if you had an update? And, this is NV86 I guess?
>>> >>>
>>> >>> Ben.
>>> >>>
>>> >>> Sent from my iPhone
>>> >>>
>>> >>> On 02/03/2011, at 8:20, Maarten Maathuis <madman2003 at gmail.com> wrote:
>>> >>>
>>> >>>> On Tue, Mar 1, 2011 at 9:51 PM, Ben Skeggs <bskeggs at redhat.com> wrote:
>>> >>>>> On Tue, 2011-03-01 at 21:08 +0000, Maarten Maathuis wrote:
>>> >>>>>
>>> >>>>>> Those come after 15-30 minutes of running warzone2100, i haven't
>>> >>>>>> played any games for a while, so no idea how long this has been going
>>> >>>>>> on.
>>> >>>>>> I also got a TRAP_CCACHE on channel 2 a little while ago, it takes
>>> >>>>>> much longer to trigger (a few hours). I'm using todays "nouveau
>>> >>>>>> kernel" git.
>>> >>>>> You're not the first person to have reported this fwiw, personally, I
>>> >>>>> haven't seen it yet..
>>> >>>>>
>>> >>>>>>
>>> >>>>>> I'm guessing something is being unmapped too early or without reason,
>>> >>>>>> or some cache is stale. But it isn't obvious what exactly it is.
>>> >>>>>>
>>> >>>>>> Because i don't remember having these lockups before I'm inclined to
>>> >>>>>> guess that this commit is involved
>>> >>>>>> http://cgit.freedesktop.org/nouveau/linux-2.6/commit/?id=6330d8f5ecc4a19fd2ad3c7fa128b2f4c2ce3360
>>> >>>>>>
>>> >>>>>> Any ideas?
>>> >>>>> Not really.  If this commit *is* the cause, the problem is still
>>> >>>>> somewhere else.  That commit just makes sure PTEs are marked invalid, so
>>> >>>>> if it's causing your faults, then previously the GPU would still have
>>> >>>>> been reading/writing invalid data.
>>> >>>>>
>>> >>>>> Plus, I expect you should probably have seen a VM fault..
>>> >>>>
>>> >>>> So these faults are just generic errors? Unrelated to page faults?
>>> >>>>
>>> >>>>>
>>> >>>>> Ben.
>>> >>>>>>
>>> >>>>>> Maarten.
>>> >>>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> --
>>> >>>> Far away from the primal instinct, the song seems to fade away, the
>>> >>>> river get wider between your thoughts and the things we do and say.
>>> >>>> _______________________________________________
>>> >>>> Nouveau mailing list
>>> >>>> Nouveau at lists.freedesktop.org
>>> >>>> http://lists.freedesktop.org/mailman/listinfo/nouveau
>>> >>>
>>> >>
>>> >> No this is NV96. The revert definitely helps, but no luck so far in
>>> >> finding a plausible cause for the problem.
>>> > Hey,
>>> >
>>> > Ok. Hmm. I thought you had NV86 for some reason! It's a long shot and I'm not entirely convinced it'll help at all, but can you switch graph.tlb_flush pointer to the nv86 version and see if anything changes?
>>>
>>> I used to have a NV86, but it died more than a year ago in the typical
>>> way for that generation of card, due to thermal issues I guess (it was
>>> a passively cooled card). I haven't tried using the nv86 tlb flush,
>>> out of curiosity, is this something nvidia does (a lot) on nv86?
>> Yes, NVIDIA do it on pretty much every card I've looked at traces for,
>> we've never seen any need for other chipsets as of yet however.
>> Originally, it looked like NVIDIA did this on all pre-NVA3 cards, but, a
>> trace of my T510 with recent drivers show that they do it on NVA3+ now
>> too.
>>
>>>
>>> >
>>> > The *other* possible thing is that the ttm delayed delete queue is causing multiple tlb flushes to happen at the same time.  I'll add locking for that in the morning, that was a complete oversight.
>>>
>>> I've had no lockups since you added the spinlocks, so maybe that was
>>> it. Time will tell.
>> *crosses fingers*
>>
>> Ben.
>>>
>>> >
>>> > Ben.
>>> >
>>> >>
>>> >> --
>>> >> Far away from the primal instinct, the song seems to fade away, the
>>> >> river get wider between your thoughts and the things we do and say.
>>> >
>>>
>>>
>>>
>>
>>
>>
>
> It went alright for quite some time (much longer than before), but i
> got another one. I should note this happened at the exact moment X
> rendered something over my fullscreen opengl app. So it does smell a
> bit fishy. I'll have a look myself at possible causes again.
>
> Mar  8 23:30:58 madman kernel: [25325.644794] [drm] nouveau
> 0000:01:00.0: PGRAPH - TRAP_CCACHE FAULT
> Mar  8 23:30:58 madman kernel: [25325.644815] [drm] nouveau
> 0000:01:00.0: PGRAPH - TRAP_CCACHE 00000080 00000000 00000000 00000000
> 00000000 00000004 00000000
> Mar  8 23:30:58 madman kernel: [25325.644829] [drm] nouveau
> 0000:01:00.0: PGRAPH - TRAP_MP - TP1: Unhandled ustatus 0x00020000
> Mar  8 23:30:58 madman kernel: [25325.644836] [drm] nouveau
> 0000:01:00.0: PGRAPH - TRAP
> Mar  8 23:30:58 madman kernel: [25325.644848] [drm] nouveau
> 0000:01:00.0: PGRAPH - ch 2 (0x0000840000) subc 5 class 0x8297 mthd
> 0x0f04 data 0x00000000
> Mar  8 23:30:58 madman kernel: [25325.644865] [drm] nouveau
> 0000:01:00.0: VM: trapped read at 0x002000f000 on ch 2 [0x00000840]
> PFIFO/PFIFO_READ/SEMAPHORE reason: DMAOBJ_LIMIT
>

An offset just above the 512 MB mark shouldn't be invalid on a dma
object covering the entire VM. I wonder what's going on here.

> --
> Far away from the primal instinct, the song seems to fade away, the
> river get wider between your thoughts and the things we do and say.
>

-- 
Far away from the primal instinct, the song seems to fade away, the
river get wider between your thoughts and the things we do and say.