[PATCH v4 0/2] Add p2p via dmabuf to habanalabs

Tue Jul 6 14:39:19 UTC 2021

On Tue, Jul 6, 2021 at 4:23 PM Jason Gunthorpe <jgg at ziepe.ca> wrote:
>
> On Tue, Jul 06, 2021 at 12:36:51PM +0200, Daniel Vetter wrote:
>
> > If that means AI companies don't want to open our their hw specs
> > enough to allow that, so be it - all you get in that case is
> > offloading the kernel side  of the stack for convenience, with zero
> > long term prospects to ever make this into a cross vendor subsystem
> > stack that does something useful.
>
> I don't think this is true at all - nouveau is probably the best
> example.
>
> nouveau reverse engineered a userspace stack for one of these devices.
>
> How much further ahead would they have been by now if they had a
> vendor supported, fully featured, open kernel driver to build the
> userspace upon?

There is actually tons of example here, most of the arm socs have
fully open kernel drivers, supported by the vendor (out of tree).

The hard part is the userspace driver and all the things you're
submitting to it. We've had open kernel drivers for mail/qualcomm/...
years before any believable open implementation started existing.
Typing up the memory manager and hw submission queue handling is
comparitively trivial. Generally the kernel driver is also done last,
you bring up the userspace first, often by just directly programming
the hw from userspace. Kernel driver only gets in the way with this
stuff (nouveau is entirely developed as a userspace driver, as the
most extreme example).

This is a bit different for the display side, but nowadays those
drivers are fully in-kernel so they're all open. Well except the
nvidia one, and I've not heard of nvidia working on even an
out-of-tree open display driver, so that won't help the in-tree effort
at all.

Where it would have helped is if this open driver would come with
redistributable firmware, because that is right now the thing making
nouveau reverse-engineering painful enough to be non-feasible. Well
not the reverse-engineering, but the "shipping the result as a working
driver stack".

I don't think the facts on the ground support your claim here, aside
from the practical problem that nvidia is unwilling to even create an
open driver to begin with. So there isn't anything to merge.

> > open up your hw enough for that, I really don't see the point in
> > merging such a driver, it'll be an unmaintainable stack by anyone else
> > who's not having access to those NDA covered specs and patents and
> > everything.
>
> My perspective from RDMA is that the drivers are black boxes. I can
> hack around the interface layers but there is a lot of wild stuff in
> there that can't be understood without access to the HW documentation.

There's shipping gpu drivers with entirely reverse-engineered stacks.
And I don't mean "shipping in fedora" but "shipping in Chrome tablets
sold by OEM partners of Google". So it's very much possible, even if
the vendor is maximally stubborn about things.

> I think only HW that has open specs, like say NVMe, can really be
> properly community oriented. Otherwise we have to work in a community
> partnership with the vendor.

Well sure that's the ideal case, but most vendors in the accel space
arent interested actual partnership with the wider community. It's
"merge this kernel driver and have no further demands about anything
else". Well there are some who are on board, but it does take pretty
enormous amounts of coercion.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch