[PATCH 11/16] gpu: nova-core: add falcon register definitions and base code

Thu May 1 00:09:28 UTC 2025

On Thu May 1, 2025 at 3:16 AM JST, Danilo Krummrich wrote:
> On Wed, Apr 30, 2025 at 10:38:11AM -0400, Joel Fernandes wrote:
>> On 4/30/2025 9:25 AM, Alexandre Courbot wrote:
>> > On Tue Apr 22, 2025 at 11:44 PM JST, Danilo Krummrich wrote:
>> 
>> >>> +/// Returns a boxed falcon HAL adequate for the passed `chipset`.
>> >>> +///
>> >>> +/// We use this function and a heap-allocated trait object instead of statically defined trait
>> >>> +/// objects because of the two-dimensional (Chipset, Engine) lookup required to return the
>> >>> +/// requested HAL.
>> >>
>> >> Do we really need the dynamic dispatch? AFAICS, there's only E::BASE that is
>> >> relevant to FalconHal impls?
>> >>
>> >> Can't we do something like I do in the following example [1]?
>> >>
>> >> [1] https://play.rust-lang.org/?version=stable&mode=debug&edition=2024&gist=bf7035a07e79a4047fb6834eac03a9f2
>> > 
>> > So are you have noticed there are two dimensions from which the falcons
>> > can be instantiated:
>> > 
>> > - The engine, which determines its register BASE,
>> > - The HAL, which is determined by the chipset.
>> > 
>> > For the engine, I want to keep things static for the main reason that if
>> > BASE was dynamic, we would have to do all our IO using
>> > try_read()/try_write() and check for an out-of-bounds error at each
>> > register access. The cost of monomorphization is limited as there are
>> > only a handful of engines.
>> > 
>> > But the HAL introduces a second dimension to this, and if we support N
>> > engines then the amount of monomorphized code would then increase by N
>> > for each new HAL we add. Chipsets are released at a good cadence, so
>> > this is the dimension that risks growing the most.
>
> I agree, avoiding the dynamic dispatch is probably not worth in this case
> considering the long term. However, I wanted to point out an alternative with
> [2].
>
>> > It is also the one that makes use of methods to abstract things (vs.
>> > fixed parameters), so it is a natural candidate for using virtual
>> > methods. I am not a fan of having ever-growing boilerplate match
>> > statements for each method that needs to be abstracted, especially since
>> > this is that virtual methods do without requiring extra code, and for a
>> > runtime penalty that is completely negligible in our context and IMHO
>> > completely balanced by the smaller binary size that results from their
>> > use.
>>
>> Adding to what Alex said, note that the runtime cost is still there even without
>> using dyn. Because at runtime, the match conditionals need to route function
>> calls to the right place.
>
> Honestly, I don't know how dynamic dispatch scales compared to static dispatch
> with conditionals.
>
> OOC, I briefly looked for a benchmark and found [3], which doesn't look
> unreasonable at a first glance.
>
> I modified it real quick to have more than 2 actions. [4]
>
> 2 Actions
> ---------
> Dynamic Dispatch: time:   [2.0679 ns 2.0825 ns 2.0945 ns]
>  Static Dispatch: time:   [850.29 ps 851.05 ps 852.36 ps]
>
> 20 Actions
> ----------
> Dynamic Dispatch: time:   [21.368 ns 21.827 ns 22.284 ns]
>  Static Dispatch: time:   [1.3623 ns 1.3703 ns 1.3793 ns]
>
> 100 Actions
> -----------
> Dynamic Dispatch: time:   [103.72 ns 104.33 ns 105.13 ns]
>  Static Dispatch: time:   [4.5905 ns 4.6311 ns 4.6775 ns]
>
> Absolutely take it with a grain of salt, I neither spend a lot of brain power
> nor time on this, which usually is not a great combination with benchmarking
> things. :)
>
> However, I think it's probably not too important here. Hence, feel free to go
> with dynamic dispatch for this.

Indeed, it looks like the cost of dispatch will be completely shadowed
by the IO behind it anyway. And these HAL calls are like a few here and
there anyway, it's not like they are on a critical path.

>
>> I am just not seeing the benefits of not using dyn for
>> this use case and only drawbacks. IMHO, we should try to not be doing the
>> compiler's job.
>> 
>> Maybe the only benefit is you don't need an Arc or Kbox wrapper?
>
> That's not a huge concern for me, it's only one single allocation per Engine,
> correct?

Correct. Note that for other engines we will be able to store the HALs as
static singletons instead of building them on the heap like I am
currently doing. The reason for doing this on falcon is that the
dual-dimension of the instances makes it more complex to build and look
them up.

... or maybe I could just use a macro? Let me try that and see whether
it works.