[PATCH v4 0/2] Add p2p via dmabuf to habanalabs

Tue Jul 6 17:35:55 UTC 2021

On Tue, Jul 6, 2021 at 6:29 PM Jason Gunthorpe <jgg at ziepe.ca> wrote:
>
> On Tue, Jul 06, 2021 at 05:49:01PM +0200, Daniel Vetter wrote:
>
> > The other thing to keep in mind is that one of these drivers supports
> > 25 years of product generations, and the other one doesn't.
>
> Sure, but that is the point, isn't it? To have an actually useful
> thing you need all of this mess
>
> > > My argument is that an in-tree open kernel driver is a big help to
> > > reverse engineering an open userspace. Having the vendors
> > > collaboration to build that monstrous thing can only help the end goal
> > > of an end to end open stack.
> >
> > Not sure where this got lost, but we're totally fine with vendors
> > using the upstream driver together with their closed stack. And most
> > of the drivers we do have in upstream are actually, at least in parts,
> > supported by the vendor. E.g. if you'd have looked the drm/arm driver
> > you picked is actually 100% written by ARM engineers. So kinda
> > unfitting example.
>
> So the argument with Habana really boils down to how much do they need
> to show in the open source space to get a kernel driver? You want to
> see the ISA or compiler at least?

Yup. We dont care about any of the fancy pieces you build on top, nor
does the compiler need to be the optimizing one. Just something that's
good enough to drive the hw in some demons to see how it works and all
that. Generally that's also not that hard to reverse engineer, if
someone is bored enough, the real fancy stuff tends to be in how you
optimize the generated code. And make it fit into the higher levels
properly.

> That at least doesn't seem "extreme" to me.
>
> > > For instance a vendor with an in-tree driver has a strong incentive to
> > > sort out their FW licensing issues so it can be redistributed.
> >
> > Nvidia has been claiming to try and sort out the FW problem for years.
> > They even managed to release a few things, but I think the last one is
> > 2-3 years late now. Partially the reason is that there don't have a
> > stable api between the firmware and driver, it's all internal from the
> > same source tree, and they don't really want to change that.
>
> Right, companies have no incentive to work in a sane way if they have
> their own parallel world. I think drawing them part by part into the
> standard open workflows and expectations is actually helpful to
> everyone.

Well we do try to get them on board part-by-part generally starting
with the kernel and ending with a proper compiler instead of the usual
llvm hack job, but for whatever reasons they really like their
in-house stuff, see below for what I mean.

> > > > I don't think the facts on the ground support your claim here, aside
> > > > from the practical problem that nvidia is unwilling to even create an
> > > > open driver to begin with. So there isn't anything to merge.
> > >
> > > The internet tells me there is nvgpu, it doesn't seem to have helped.
> >
> > Not sure which one you mean, but every once in a while they open up a
> > few headers, or a few programming specs, or a small driver somewhere
> > for a very specific thing, and then it dies again or gets obfuscated
> > for the next platform, or just never updated. I've never seen anything
> > that comes remotely to something complete, aside from tegra socs,
> > which are fully supported in upstream afaik.
>
> I understand nvgpu is the tegra driver that people actualy
> use. nouveau may have good tegra support but is it used in any actual
> commercial product?

I think it was almost the case. Afaik they still have their internal
userspace stack working on top of nvidia, at least last year someone
fixed up a bunch of issues in the tegra+nouveau combo to enable format
modifiers properly across the board. But also nvidia is never going to
sell you that as the officially supported thing, unless your ask comes
back with enormous amounts of sold hardware.

And it's not just nvidia, it's pretty much everyone. Like a soc
company I don't want to know started collaborating with upstream and
the reverse-engineered mesa team on a kernel driver, seems to work
pretty well for current hardware. But for the next generation they
decided it's going to be again only their in-house tree that
completele ignores drivers/gpu/drm, and also tosses all the
foundational work they helped build on the userspace side. And this is
consistent across all companies, over the last 20 years I know of
(often non-public) stories across every single company where they
decided that all the time invested into community/upstream
collaboration isn't useful anymore, we go all vendor solo for the next
one.

Most of those you luckily don't hear about anymore, all it results in
the upstream driver being 1-2 years late or so. But even the good ones
where we collaborate well can't seem to help themselves and want to
throw it all away every few years.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch