[PATCH v4 0/2] Add p2p via dmabuf to habanalabs

Tue Jul 6 15:52:49 UTC 2021

On Tue, Jul 6, 2021 at 4:56 PM Jason Gunthorpe <jgg at ziepe.ca> wrote:
> On Tue, Jul 06, 2021 at 04:09:25PM +0200, Daniel Vetter wrote:
> > Anyway, for anything that works like a gpu accelerator, like 3d accel,
> > or parallel compute accel (aka gpgpu) or spatial compute accel (aka
> > NN/AI) or maybe even fpga accel most of the magic to use the hardware
> > is in this backend compiler, which translates from an IR into whatever
> > your accelerator consumes. That's the part we really care about for
> > modern accelerators because without that defacto the hardware is
> > useless. Generally these chips have full-blown, if special purpose
> > ISA, with register files, spilling, branches, loops and other control
> > flow (sometimes only execution masks on simpler hw).
>
> I don't know if I see it so clearly as you do - at the end of the day
> the user keys in the program in some proprietary (or open!) language
> and and wack of propritary magic transforms it to "make it work".
>
> There are many barriers that prevent someone without the secret
> knowledge from duplicating the end result of a working program. An
> accelerator ISA is certainly one example, but I wouldn't overly focus
> on it as the only blocker.

Well we don't, we do just ask for the full driver stack to make the hw
work. It's just that in the past most vendors choose to leave out the
compiler/ISA from their open stack/specs. Well except nvidia, which
still chooses to leave out everything aside from some very, very
minimal thing around documenting display functionality.

> Like you said below the NVIDIA GPU ISA seems known but the HW is still
> not really useful for other reasons.
>
> Habana seems to have gone the other way, the HW is fully useful but we
> don't have the ISA transformation and other details.

You can actually use nvidia gpus, they're fully functional.

If you install the blobby stack. Which is exactly the same thing as
with habanalabs, plus/minus a few things at the fringes.

In the end it's about drawing the line somewhere, so maybe we should
merge the nvidia glue code that makes their blobby stack work better
with upstream? There's quite a few pieces there, e.g. their display
driver is by design a userspace driver, whereas with kernel
modesetting it needs to be in the kernel to expose the common kms
ioctl interfaces, so they've built up a glue layer to forward
everything to userspace and back. On windows it works because there
kernel code can have growing stacks and fun stuff like that, at least
that's my understanding. Not really an option to just run the code in
linux.

I'm pretty sure nvidia would appreciate that, and maybe every once in
a while they open up a header for a generation or two of products like
they've done in the past.

> Both cases seem to have ended up with something useless, and I have a
> hard time saying nouveau has more right to be in the kernel tree than
> Habana does.
>
> > > Honestly, I think GPU is approaching this backwards. Wayland should
> > > have been designed to prevent proprietary userspace stacks.
> >
> > That's not possible without some serious cans of worms though. Wayland
> > is a protocol, and you can't forbid people from implementing it.
> > Otherwise all the compatible open implementations of closed protocols
> > wouldn't be possible either.
>
> Well, in many ways so is Linux, but nobody would seriously
> re-implement Linux just to produce a driver.

Well in the gpu space for 2+ decades nvidia has been setting the
standard, and the open stack has been trying to catch up by
reimplementing the entire thing. It took a fair while.

> > So I'm not clear what you're suggesting here we should do different.
>
> Not enabling proprietary stacks as above would be a good start.

I'm still not sure what exactly you mean here. Like on the 3d side
there's opengl and vulkan, and nvidia just has an entirely different
implementation of that compared to any of the open drivers. That is a
bit less code than linux, but it's not small, and reimplementing over
decades is pretty much what happened. And if it's not allowed we'd
actually not have an open 3d gpu stack at all, because only very
recently did we get an agreement around the tracemark/licensing issues
of that stuff with Khronos. Recently compared to the history of opengl
at least.

So I'm still not clear what exactly it is you're suggesting we should
do? Not implement the industry standards for 3d (and accept we stay
irrelevant forever)? Reject nvidia blobs harder than we do already?
Distros will continue to ship an auto-installer for that stack, at
least some, so we're pretty much maxed out already. Like in what way
do you think the upstream stack does enable the proprietary nvidia
stack? Should we permanently ban any contributions from anyone with an
@nvidia.com address, even if it helps the open stack improve?

Like I'm not seeing something concrete that could be done, which would
actually prevent nvidia from having their completely independent
stack, with exact same functionality and not a line of code shared.
Which is were we are right now. The only thing where we could be more
strict is to reject any contributions from them at all, just because
we don't like them. That seems a bit too extreme
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch