[Nouveau] CCACHE and VFETCH FAULTs causing lockups

Maarten Maathuis madman2003 at gmail.com
Tue Mar 8 15:06:22 PST 2011


On Tue, Mar 8, 2011 at 10:44 PM, Maarten Maathuis <madman2003 at gmail.com> wrote:
> On Mon, Mar 7, 2011 at 10:22 PM, Ben Skeggs <skeggsb at gmail.com> wrote:
>> On Mon, 2011-03-07 at 21:51 +0000, Maarten Maathuis wrote:
>>> On Sun, Mar 6, 2011 at 2:24 PM, Ben Skeggs <skeggsb at gmail.com> wrote:
>>> >
>>> >
>>> > Sent from my iPhone
>>> >
>>> > On 07/03/2011, at 0:03, Maarten Maathuis <madman2003 at gmail.com> wrote:
>>> >
>>> >> On Sun, Mar 6, 2011 at 1:44 PM, Ben Skeggs <skeggsb at gmail.com> wrote:
>>> >>> Sorry for the top posting, it's late and typing from my phone in bed lol.
>>> >>>
>>> >>> Just wanted to see if you had an update? And, this is NV86 I guess?
>>> >>>
>>> >>> Ben.
>>> >>>
>>> >>> Sent from my iPhone
>>> >>>
>>> >>> On 02/03/2011, at 8:20, Maarten Maathuis <madman2003 at gmail.com> wrote:
>>> >>>
>>> >>>> On Tue, Mar 1, 2011 at 9:51 PM, Ben Skeggs <bskeggs at redhat.com> wrote:
>>> >>>>> On Tue, 2011-03-01 at 21:08 +0000, Maarten Maathuis wrote:
>>> >>>>>
>>> >>>>>> Those come after 15-30 minutes of running warzone2100, i haven't
>>> >>>>>> played any games for a while, so no idea how long this has been going
>>> >>>>>> on.
>>> >>>>>> I also got a TRAP_CCACHE on channel 2 a little while ago, it takes
>>> >>>>>> much longer to trigger (a few hours). I'm using todays "nouveau
>>> >>>>>> kernel" git.
>>> >>>>> You're not the first person to have reported this fwiw, personally, I
>>> >>>>> haven't seen it yet..
>>> >>>>>
>>> >>>>>>
>>> >>>>>> I'm guessing something is being unmapped too early or without reason,
>>> >>>>>> or some cache is stale. But it isn't obvious what exactly it is.
>>> >>>>>>
>>> >>>>>> Because i don't remember having these lockups before I'm inclined to
>>> >>>>>> guess that this commit is involved
>>> >>>>>> http://cgit.freedesktop.org/nouveau/linux-2.6/commit/?id=6330d8f5ecc4a19fd2ad3c7fa128b2f4c2ce3360
>>> >>>>>>
>>> >>>>>> Any ideas?
>>> >>>>> Not really.  If this commit *is* the cause, the problem is still
>>> >>>>> somewhere else.  That commit just makes sure PTEs are marked invalid, so
>>> >>>>> if it's causing your faults, then previously the GPU would still have
>>> >>>>> been reading/writing invalid data.
>>> >>>>>
>>> >>>>> Plus, I expect you should probably have seen a VM fault..
>>> >>>>
>>> >>>> So these faults are just generic errors? Unrelated to page faults?
>>> >>>>
>>> >>>>>
>>> >>>>> Ben.
>>> >>>>>>
>>> >>>>>> Maarten.
>>> >>>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> --
>>> >>>> Far away from the primal instinct, the song seems to fade away, the
>>> >>>> river get wider between your thoughts and the things we do and say.
>>> >>>> _______________________________________________
>>> >>>> Nouveau mailing list
>>> >>>> Nouveau at lists.freedesktop.org
>>> >>>> http://lists.freedesktop.org/mailman/listinfo/nouveau
>>> >>>
>>> >>
>>> >> No this is NV96. The revert definitely helps, but no luck so far in
>>> >> finding a plausible cause for the problem.
>>> > Hey,
>>> >
>>> > Ok. Hmm. I thought you had NV86 for some reason! It's a long shot and I'm not entirely convinced it'll help at all, but can you switch graph.tlb_flush pointer to the nv86 version and see if anything changes?
>>>
>>> I used to have a NV86, but it died more than a year ago in the typical
>>> way for that generation of card, due to thermal issues I guess (it was
>>> a passively cooled card). I haven't tried using the nv86 tlb flush,
>>> out of curiosity, is this something nvidia does (a lot) on nv86?
>> Yes, NVIDIA do it on pretty much every card I've looked at traces for,
>> we've never seen any need for other chipsets as of yet however.
>> Originally, it looked like NVIDIA did this on all pre-NVA3 cards, but, a
>> trace of my T510 with recent drivers show that they do it on NVA3+ now
>> too.
>>
>>>
>>> >
>>> > The *other* possible thing is that the ttm delayed delete queue is causing multiple tlb flushes to happen at the same time.  I'll add locking for that in the morning, that was a complete oversight.
>>>
>>> I've had no lockups since you added the spinlocks, so maybe that was
>>> it. Time will tell.
>> *crosses fingers*
>>
>> Ben.
>>>
>>> >
>>> > Ben.
>>> >
>>> >>
>>> >> --
>>> >> Far away from the primal instinct, the song seems to fade away, the
>>> >> river get wider between your thoughts and the things we do and say.
>>> >
>>>
>>>
>>>
>>
>>
>>
>
> It went alright for quite some time (much longer than before), but i
> got another one. I should note this happened at the exact moment X
> rendered something over my fullscreen opengl app. So it does smell a
> bit fishy. I'll have a look myself at possible causes again.
>
> Mar  8 23:30:58 madman kernel: [25325.644794] [drm] nouveau
> 0000:01:00.0: PGRAPH - TRAP_CCACHE FAULT
> Mar  8 23:30:58 madman kernel: [25325.644815] [drm] nouveau
> 0000:01:00.0: PGRAPH - TRAP_CCACHE 00000080 00000000 00000000 00000000
> 00000000 00000004 00000000
> Mar  8 23:30:58 madman kernel: [25325.644829] [drm] nouveau
> 0000:01:00.0: PGRAPH - TRAP_MP - TP1: Unhandled ustatus 0x00020000
> Mar  8 23:30:58 madman kernel: [25325.644836] [drm] nouveau
> 0000:01:00.0: PGRAPH - TRAP
> Mar  8 23:30:58 madman kernel: [25325.644848] [drm] nouveau
> 0000:01:00.0: PGRAPH - ch 2 (0x0000840000) subc 5 class 0x8297 mthd
> 0x0f04 data 0x00000000
> Mar  8 23:30:58 madman kernel: [25325.644865] [drm] nouveau
> 0000:01:00.0: VM: trapped read at 0x002000f000 on ch 2 [0x00000840]
> PFIFO/PFIFO_READ/SEMAPHORE reason: DMAOBJ_LIMIT
>

An offset just above the 512 MB mark shouldn't be invalid on a dma
object covering the entire VM. I wonder what's going on here.

> --
> Far away from the primal instinct, the song seems to fade away, the
> river get wider between your thoughts and the things we do and say.
>



-- 
Far away from the primal instinct, the song seems to fade away, the
river get wider between your thoughts and the things we do and say.


More information about the Nouveau mailing list