[RFC v5 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem

Mon Aug 25 09:38:30 UTC 2025

On 14-08-2025 01:51, Rodrigo Vivi wrote:
> On Wed, Jul 30, 2025 at 12:19:51PM +0530, Aravind Iddamsetty wrote:
>> Revisiting this patch series to address pending feedback and help move
>> the discussion towards a conclusion. This revision includes updates
>> based on previous comments[1] and aims to clarify outstanding concerns.
>> Specifically added command to facility reporting errors from IP blocks
>> to support AMDGPU driver model of RAS.
>> [1]: https://lore.kernel.org/all/4cbdfcc5-5020-a942-740e-a602d4c00cc2@linux.intel.com/
>>
>> I sincerely appreciate everyones patience and thoughtful reviews so
>> far, and I hope this refreshed series facilitates the final evaluation
>> and acceptance.
>>
>> Please feel free to share any further suggestions or questions.
>>
>> Thank you for your continued consideration.
>> ----------------------------------------------------------------------
>>
>> Our hardware supports RAS(Reliability, Availability, Serviceability) by
>> reporting the errors to the host, which the KMD processes and exposes a
>> set of error counters which can be used by observability tools to take 
>> corrective actions or repairs. Traditionally there were being exposed 
>> via PMU (for relative counters) and sysfs interface (for absolute 
>> value) in our internal branch. But, due to the limitations in this 
>> approach to use two interfaces and also not able to have an event based 
>> reporting or configurability, an alternative approach to try netlink 
>> was suggested by community for drm subsystem wide UAPI for RAS and 
>> telemetry as discussed in [2]. 
>>
>> This [2] is the inspiration to this series. It uses the generic
>> netlink(genl) family subsystem and exposes a set of commands that can
>> be used by every drm driver, the framework provides a means to have
>> custom commands too. Each drm driver instance in this example xe driver
>> instance registers a family and operations to the genl subsystem through
>> which it enumerates and reports the error counters. An event based
>> notification is also supported to which userpace can subscribe to and
>> be notified when any error occurs and read the error counter this avoids
>> continuous polling on error counter. This can also be extended to
>> threshold based notification.
>>
>> [2]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
> I'm bringing some thoughts below and I'd like to get inputs from folks involved
> in the original discussions here please.
> Any thought is welcomed so we can move faster towards a real GPU standard RAS
> solution.
>
>> This series is on top [3] series which introduces error counting infra in Xe
>> driver.
>> [3]: https://lore.kernel.org/all/20250730054814.1376770-1-aravind.iddamsetty@linux.intel.com/
>>
>> V5:
>> Add support to read error corresponding to an IP BLOCK
> I honestly don't believe that this version solves all the concerns raised by
> AMD folks in the previous reviews. It is true that this is bringing ways of
> reading errors per IP block, but if I understood them correctly, they would
> like better (and separate) ways to declare and handle the errors coming from
> different IP block, rather than simply reading/querying for them filtered out.
>
> So, I have som grouping ideas below.

As per the comment from Lijo,
https://lore.kernel.org/all/aa23f0ef-a4ab-ca73-5ab3-ef23d6e36e89@amd.com/

the errors are grouped per bitmask, they are not expecting a separation
at netlink level.

<31:24> = Block Id
<23:16> subblock id
<15:8> - interested instance
<7:0> - error_type

The interface should  support errors per IP block and instance, which
the current series support via DRM_RAS_CMD_READ_BLOCK.
when driver receives the command DRM_RAS_CMD_READ_BLOCK it is supposed
to decipher the bits based on the above bitsmask.
The query command is expected to list the blocks and instances
available, the counters of which will be read via DRM_RAS_CMD_READ_BLOCK.
>
>> v4:
>> 1. Rebase
>> 2. rename drm_genl_send to drm_genl_reply
> But before going to the ideas below I'd like to also raise the naming issue
> that I see with this proposal.
>
> I was recently running some experiments to devlink with this and similar
> cases. I don't believe that devlink is a good fit for our drm-ras. It is
> way too much centric on network devices and any addition there to our
> GPU RAS would be a heavy lift. But, there are some good things from there
> that we could perhaps get inspiration from.
>
> Starting from the name. devlink is the name of the tool and the name
> of the framework. It uses netlink on the back, but totally abstracting
> that. Here in this version we can see:
> drm_ras: the tool
> drm_netlink: the abstraction
> drm_genl_*: the wrapper?
>
> So, I believe that as devlink we should have a single name for everything
> and avoid wrappers but providing the real module registration, with
> groups, and functions. Entirely abstracting the netlink and focusing
> on the RAS functionalities.
sounds interesting and I feel it looks clean too. But that does mean we
completely handle
the netlink framework inside the drm layer and not at the driver and
expose callback ops to
drm drivers.

Thanks,
Aravind.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/intel-xe/attachments/20250825/28beadfb/attachment.htm>