[Nouveau] GT710 and Nouveau on ARM/ARM64

Dave Stevenson dave.stevenson at raspberrypi.com
Tue Nov 3 18:34:25 UTC 2020


On Tue, 3 Nov 2020 at 18:25, Ilia Mirkin <imirkin at alum.mit.edu> wrote:
>
> On Tue, Nov 3, 2020 at 1:08 PM Dave Stevenson
> <dave.stevenson at raspberrypi.com> wrote:
> >
> > Hi Ilia
> > Thanks again for the reply.
> >
> > On Wed, 28 Oct 2020 at 14:59, Ilia Mirkin <imirkin at alum.mit.edu> wrote:
> > >
> > > On Wed, Oct 28, 2020 at 10:20 AM Dave Stevenson
> > > <dave.stevenson at raspberrypi.com> wrote:
> > > >
> > > > Hi Ilia
> > > >
> > > > Thanks for taking the time to reply.
> > > >
> > > > On Wed, 28 Oct 2020 at 14:10, Ilia Mirkin <imirkin at alum.mit.edu> wrote:
> > > > >
> > > > > The most common issue on arm is that the pci memory window is too narrow to allocate all the BARs. Can you see if there are messages in the kernel to that effect?
> > > >
> > > > All the BAR allocations seem to succeed except for the IO one.
> > > > AIUI I/O is deprecated, but is it still used on these cards?
> > >
> > > I must admit I was ignorant of the fact that the IO ports were treated
> > > as a BAR, but it makes a lot of sense.
> > >
> > > One thing does stand out as odd:
> > >
> > > >
> > > > [    1.060851] brcm-pcie fd500000.pcie: host bridge /scb/pcie at 7d500000 ranges:
> > > > [    1.060892] brcm-pcie fd500000.pcie:   No bus range found for
> > > > /scb/pcie at 7d500000, using [bus 00-ff]
> > > > [    1.060975] brcm-pcie fd500000.pcie:      MEM
> > > > 0x0600000000..0x063fffffff -> 0x00c0000000
> > > > [    1.061061] brcm-pcie fd500000.pcie:   IB MEM
> > > > 0x0000000000..0x00ffffffff -> 0x0100000000
> > > > [    1.109943] brcm-pcie fd500000.pcie: link up, 5.0 GT/s PCIe x1 (SSC)
> > > > [    1.110129] brcm-pcie fd500000.pcie: PCI host bridge to bus 0000:00
> > > > [    1.110159] pci_bus 0000:00: root bus resource [bus 00-ff]
> > > > [    1.110187] pci_bus 0000:00: root bus resource [mem
> > > > 0x600000000-0x63fffffff] (bus address [0xc0000000-0xffffffff])
> > > > [    1.110286] pci 0000:00:00.0: [14e4:2711] type 01 class 0x060400
> > > > [    1.110505] pci 0000:00:00.0: PME# supported from D0 D3hot
> > > > [    1.114095] pci 0000:00:00.0: bridge configuration invalid ([bus
> > > > 00-00]), reconfiguring
> > > > [    1.114343] pci 0000:01:00.0: [10de:128b] type 00 class 0x030000
> > > > [    1.114404] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x00ffffff]
> > > > [    1.114456] pci 0000:01:00.0: reg 0x14: [mem 0x00000000-0x07ffffff
> > > > 64bit pref]
> > > > [    1.114510] pci 0000:01:00.0: reg 0x1c: [mem 0x00000000-0x01ffffff
> > > > 64bit pref]
> > > > [    1.114551] pci 0000:01:00.0: reg 0x24: [io  0x0000-0x007f]
> > > > [    1.114590] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0007ffff pref]
> > > > [    1.114853] pci 0000:01:00.0: 4.000 Gb/s available PCIe bandwidth,
> > > > limited by 5.0 GT/s PCIe x1 link at 0000:00:00.0 (capable of 63.008
> > > > Gb/s with 8.0 GT/s PCIe x8 link)
> > > > [    1.115022] pci 0000:01:00.0: vgaarb: VGA device added:
> > > > decodes=io+mem,owns=none,locks=none
> > > > [    1.115125] pci 0000:01:00.1: [10de:0e0f] type 00 class 0x040300
> > > > [    1.115184] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff]
> > > > [    1.119065] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01
> > > > [    1.119120] pci 0000:00:00.0: BAR 9: assigned [mem
> > > > 0x600000000-0x60bffffff 64bit pref]
> > > > [    1.119151] pci 0000:00:00.0: BAR 8: assigned [mem 0x60c000000-0x60d7fffff]
> > >
> > > This is your brcm-pcie device.
> > >
> > > > [    1.119183] pci 0000:01:00.0: BAR 1: assigned [mem
> > > > 0x600000000-0x607ffffff 64bit pref]
> > > > [    1.119235] pci 0000:01:00.0: BAR 3: assigned [mem
> > > > 0x608000000-0x609ffffff 64bit pref]
> > > > [    1.119285] pci 0000:01:00.0: BAR 0: assigned [mem 0x60c000000-0x60cffffff]
> > >
> > > And this is the NVIDIA device. Note that these memory windows are
> > > identical (or at least overlapping). I must admit almost complete
> > > ignorance of PCIe and whether this is OK, but it seems sketchy at
> > > first glance. A quick eyeballing of my x86 system suggests that all
> > > PCIe devices get non-overlapping windows. OTOH there are messages
> > > further up about some sort of remapping going on, so perhaps it's OK?
> > > But two things on the same bus still shouldn't have the same addresses
> > > allocated, based on my (limited) understanding.
> >
> > I've raised this with colleagues and it seems that this is normal.
> > The PCI bridge reports the window through which devices can be mapped,
> > and all devices have to fit within that. Pass as to whether that is a
> > quirk of ARM or this particular bridge.
> >
> > I do note that on my x86 systems device 0000:00:00.0 is reported by
> > lspci as a "Host bridge" instead of a "PCI bridge".
> > On an Ubuntu VM I've got running, I do get
> > [    0.487249] PCI host bridge to bus 0000:00
> > [    0.487252] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7 window]
> > [    0.487254] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff window]
> > [    0.487256] pci_bus 0000:00: root bus resource [mem
> > 0x000a0000-0x000bffff window]
> > [    0.487258] pci_bus 0000:00: root bus resource [mem
> > 0xe0000000-0xfdffffff window]
> > [    0.487260] pci_bus 0000:00: root bus resource [bus 00-ff]
> > and all device allocations are from within those ranges, so I'm not
> > convinced it's that different.
> >
> > > In case it's an option, could you "unplug" the NIC (not just not load
> > > its driver, but make it not appear at all on the PCI bus)?
> >
> > NIC? The network interface is totally separate. Or is this another
> > reuse of a TLA?
> >
> > Unplugging the GPU means the PCI bus reports as being down and I get
> > no output at all from lspci.
>
> Oh duh. I thought brcm-pcie was a broadcom NIC. Apparently it's the
> whole bus - can't unplug that! Also explains the "conflict" which
> makes a lot more sense if you (correctly) understand that other
> "device" is the bus itself. Apologies for the misinterpretation :(
> [And in hindsight, RPi runs on a Broadcom SoC, so ... I should have
> remembered that. In my mind they just make network stuff, will try to
> get that updated.]

Phew, I thought I was going crazy :-)

> > > > >> I've tried it so far with a GT710 board [1] and ARM64. It's blowing up
> > > > >> in the memset of nvkm_instobj_new whilst initialising the BAR
> > > > >> subdevice [2], having gone through the "No such luck" path in
> > > > >> nvkm_mmu_ptc_get [3].
> > > > >>
> > > > >> Taking the naive approach of simply removing the memset, I get through
> > > > >> initialising all the subdevices, but again die in a location I
> > > > >> currently haven't pinpointed. The last logging messages are:
> > >
> > > That's not a winning strategy, I'm afraid. You need to figure out why
> > > the memset is blowing up. The simplest explanation is "it's trying to
> > > write to an I/O resource but that resource wasn't allocated", hence my
> > > question about BARs. But something's not mapped, or mapped in the
> > > wrong way, or whatever. If you can't write to it at that point in
> > > time, you won't be able to write to it later either. I would focus on
> > > that.
> >
> > I did say it was the naive approach :-)
> > I was trying to gauge how much effort was going to be needed to get
> > this going. Was it going to blow up in 1, 10, or 100 places? It feels
> > like it is only a couple of things that are wrong, so there is hope.
> >
> > Slightly annoyingly something more urgent has come up and I need to
> > shelve my experimentation for now, but thanks for the pointers. At
> > least I have some idea of where to start looking when time allows.
>
> When/if you do get back to it, you might consider posting a more
> complete log without getting rid of the memset, perhaps the nature of
> the blow-up will make the underlying problem more apparent, or make
> further investigation paths apparent.

Will do, thanks.

  Dave


More information about the Nouveau mailing list