[Nouveau] [PATCH/TESTING(all hw)/DISCUSSION] FIFO (minor) create and (major) destroy instabilities on nv50+

Tue Jan 5 00:41:21 PST 2010

On Tue, Jan 5, 2010 at 4:20 AM, Ben Skeggs <skeggsb at gmail.com> wrote:
> On Mon, 2010-01-04 at 23:54 +0100, Maarten Maathuis wrote:
>> I forgot to mention that you should run nop from fbcon without X
>> running for reliable lockups.
> Yup, that's what I've been doing.
>
>>
>> On Mon, Jan 4, 2010 at 11:39 PM, Ben Skeggs <skeggsb at gmail.com> wrote:
>> > On Mon, 2010-01-04 at 20:29 +0100, Maarten Maathuis wrote:
>> >> I've narrowed it down further, the "pgraph->fifo_access" bit is still
>> >> cleanup (register 0x400500 represents pgraph fifo access), the rest
>> >> appears needed for the desired effect. The reordering of pfifo and
>> >> pgraph destroy is needed. As usual, feedback is appreciated.
>> > I played a bit yesterday and have the gr/fifoctx unload ordering swap
>> > and queued up already, as well as unconditionally waiting on a fence at
>> > channel destroy (not really needed, but served as a bit of a cleanup
>> > anyway).
>> >
>> > I'll try and look at the rest of the changes.
>> >
> Mmm OK.  The gr/fifoctx swap appears to just achieve a little extra
> delay before we hit the grctx unload, some of the other changes (the
> PGRAPH stuff in fifo channel disable specifically) work around the
> changed ordering.
>
> For an identical effect, add a nice mdelay(50) right before the
> pgraph->fifo_access(dev, false) in nouveau_channel_free()..  We have a
> race.

So what do you propose as the preferred solution?

>
> Ben.
>> > Ben.
>> >>
>> >> Maarten.
>> >>
>> >> On Sat, Jan 2, 2010 at 4:36 PM, Maarten Maathuis <madman2003 at gmail.com> wrote:
>> >> > Many people using nv50+ hardware are aware of gpu lockups when a fifo
>> >> > closes under certain conditions. Based on a mmio-trace and some trail
>> >> > and error testing i've come up with a patch that improves the
>> >> > situation on my NV96.
>> >> >
>> >> > This patch needs testing on NV50+ hardware and regression testing on
>> >> > older hardware, since i did change some of the common codepaths. This
>> >> > is very much a work in progress, and if you have anything to
>> >> > add/correct, please share it.
>> >> >
>> >> > I've also attached a 2 test apps, once is bitscan-fail from mwk, use
>> >> > it like ./bitscan-fail 0x200 to trigger PGRAPH errors. A modified
>> >> > version only emits NOPs (method 0x100) and represents the no error
>> >> > situation.
>> >> >
>> >> > For me, i can run the NOP program in loops of 10000 iterations with no
>> >> > problems (i've done so several times), the bitscan-fail survives 10000
>> >> > iterations sometimes, but can also fail after a few thousand. In
>> >> > comparison, a single run of bitscan-fail could cause a gpu lockup for
>> >> > me in the past.
>> >> >
>> >> > Please try the gallium driver, the test apps, suspend to ram. Suspend
>> >> > to ram isn't 100% reliable yet for me (this was always the case after
>> >> > strange experiments/hammering/etc), but should not regress. This goes
>> >> > for older hw as well, whatever worked should still work, but i
>> >> > wouldn't expect serious improvements there.
>> >> >
>> >> > As always, feedback is appreciated, especially since this is a touchy subject.
>> >> >
>> >> > Maarten.
>> >> >
>> >> _______________________________________________
>> >> Nouveau mailing list
>> >> Nouveau at lists.freedesktop.org
>> >> http://lists.freedesktop.org/mailman/listinfo/nouveau
>> >
>> >
>> >
>
>
>