[PATCH] accel/qaic: Add Reliability, Accessibility, Serviceability (RAS)

Jacek Lawrynowicz jacek.lawrynowicz at linux.intel.com
Tue May 13 16:13:21 UTC 2025


Hi,

On 5/13/2025 5:05 PM, Jeff Hugo wrote:
> On 5/13/2025 3:53 AM, Jacek Lawrynowicz wrote:
>> Hi,
>>
>> On 5/12/2025 9:49 PM, Jeff Hugo wrote:
>>> diff --git a/drivers/accel/qaic/qaic_ras.c b/drivers/accel/qaic/qaic_ras.c
>>> new file mode 100644
>>> index 000000000000..2f8c1f08dbc0
>>> --- /dev/null
>>> +++ b/drivers/accel/qaic/qaic_ras.c
>>> @@ -0,0 +1,629 @@
>>> +// SPDX-License-Identifier: GPL-2.0-only
>>> +
>>> +/* Copyright (c) 2020-2021, The Linux Foundation. All rights reserved. */
>>> +/* Copyright (c) 2022-2024 Qualcomm Innovation Center, Inc. All rights reserved. */
>>
>> 2025?
> 
> No, we haven't made any changes to this file this year, so our policy would be to omit 2025.  Historically we've experienced that the community wants the file markings like this to track any copyright prior to the inclusion of the code into upstream/mainline, and then to rely on git metadata to track copyright after inclusion. Therefore that is the policy we follow.  However, these markings might be changing based the rest of your feedback.
> 
>>> +struct ras_data {
>>> +    /* header start */
>>> +    /* Magic number to validate the message */
>>> +    u16 magic;
>>> +    /* RAS version number */
>>> +    u16 ver;
>>> +    u32 seq_num;
>>> +    /* RAS message type */
>>> +    u8  type;
>>> +    u8  id;
>>> +    /* Size of RAS message without the header in byte */
>>> +    u16 len;
>>> +    /* header end */
>>> +    s32 result;
>>> +    /*
>>> +     * Error source
>>> +     * 0 : SoC Memory
>>> +     * 1 : PCIE
>>> +     * 2 : DDR
>>> +     * 3 : System Bus source 1
>>> +     * 4 : System Bus source 2
>>> +     * 5 : NSP Memory
>>> +     * 6 : Temperature Sensors
>>> +     */
>>> +    u32 source;
>>> +    /*
>>> +     * Stores the error type, there are three types of error in RAS
>>> +     * 0 : correctable error (CE)
>>> +     * 1 : uncorrectable error (UE)
>>> +     * 2 : uncorrectable error that is non-fatal (UE_NF)
>>> +     */
>>> +    u32 err_type;
>>> +    u32 err_threshold;
>>
>> This is unused. Maybe it could be useful?
> 
> The device can be configured to only make a RAS report to the host after a threshold of events has occured - say every 10 DDR ECC events, report to the host (qaic driver). This field basically restates what that configured limit is. I suppose we can include it in the logged reports to signify that this report really represents N incidents on the device.
> 
>>> +    case PCIE:
>>> +        pci_printk(level, qdev->pdev, "RAS event.\nClass:%s\nDescription:%s %s %s\n",
>>> +               err_class_str[msg->err_type],
>>> +               err_type_str[msg->err_type],
>>> +               "error from",
>>> +               err_src_str[msg->source]);
>>> +
>>> +        switch (msg->err_type) {
>>> +        case CE:
>>> +            printk(KERN_WARNING pr_fmt("Syndrome:\n    Bad TLP count %d\n    Bad DLLP count %d\n    Replay Rollover count %d\n    Replay Timeout count %d\n    Recv Error count %d\n    Internal CE count %d\n"),
>>> +                   pcie_syndrome->bad_tlp,
>>> +                   pcie_syndrome->bad_dllp,
>>> +                   pcie_syndrome->replay_rollover,
>>> +                   pcie_syndrome->replay_timeout,
>>> +                   pcie_syndrome->rx_err,
>>> +                   pcie_syndrome->internal_ce_count);
>>
>> Why not pci_printk() that would be conistent with the rest of logging?
>> It there is a reson I would prefer pr_warn/pr_err style logs.
> 
> This is a special case. This is a continuation of the pci_printk() a few lines up. If we do pci_printk() here, then the entire message gets broken up is a weird way. In the middle of the report, you'll have the "header" that pci_printk() adds (PCI device, driver, etc) repeted.
> 
> The way to avoid that would be to restructure this bit of the code to have all the switches/ifs resolved, and have a single pci_printk() for the entire decoded report.  That means we'll have a lot of duplicated code since the common "report header" for the different PCIe reports would need to be duplicated for each report variant.
> 
> This felt like the cleaner solution, although it does have its quirks.
> 
> Would a comment help?

Sure, comment would make this more readable. I would still consider using a single pci_printk() per event anyway because you could get these messages broken up if there is a lot of dmesg output from other places.
printk() takes a global lock in case there is VGA or serial console connected, so the messages wouldn't be split but it is up to you if you prefer cleaner code or cleaner logs.

>>> +static void qaic_ras_mhi_dl_xfer_cb(struct mhi_device *mhi_dev, struct mhi_result *mhi_result)
>>> +{
>>> +    struct qaic_device *qdev = dev_get_drvdata(&mhi_dev->dev);
>>> +    struct ras_data *msg = mhi_result->buf_addr;
>>> +    int ret;
>>> +
>>> +    if (mhi_result->transaction_status) {
>>> +        kfree(msg);
>>> +        return;
>>> +    }
>>> +
>>> +    ras_msg_to_cpu(msg);
>>> +    decode_ras_msg(qdev, msg);
>>> +
>>> +    ret = mhi_queue_buf(qdev->ras_ch, DMA_FROM_DEVICE, msg, sizeof(*msg), MHI_EOT);
>>> +    if (ret) {
>>> +        dev_err(&mhi_dev->dev, "Cannot requeue RAS recv buf %d\n", ret);
>>> +        kfree(msg);
>>
>> Woudn't error here prevent any future messages from being received?
> 
> Sadly, yes. This should only happen if there is some issue with the underlying PCIe link.
> 
>>> diff --git a/drivers/accel/qaic/qaic_ras.h b/drivers/accel/qaic/qaic_ras.h
>>> new file mode 100644
>>> index 000000000000..5df6cb9dae80
>>> --- /dev/null
>>> +++ b/drivers/accel/qaic/qaic_ras.h
>>> @@ -0,0 +1,11 @@
>>> +/* SPDX-License-Identifier: GPL-2.0-only
>>
>> Should be:
>> /* SPDX-License-Identifier: GPL-2.0-only */
>> or
>> // SPDX-License-Identifier: GPL-2.0-only
> 
> The "//" syntax is for C source files (foo.c) and this is a header file, so I don't think that suggestion applies.
> 
> https://docs.kernel.org/process/license-rules.html
> 
> C style comment ( /* */ ) is the correct syntax for header files. It is unclear to me that the marking needs to be its own comment, instead of included in the body of another comment. I would say that it is typical to have license markings and copyright markings in the same comment block.
> 
> Do you have a reference you can point me to that would clarify this? Perhaps a different file in Documentation or another email thread?

This is a little bit pedantic but the docs you pointed to and every other .h file in the driver seem to close the comment on the same line.

>>> + *
>>> + * Copyright (c) 2020, The Linux Foundation. All rights reserved.
>>
>> 2025?
> 
> No, per the above reasons.
> 
>>> + */
>>> +
>>> +#ifndef __QAIC_RAS_H__
>>> +#define __QAIC_RAS_H__
>>> +
>>> +int qaic_ras_register(void);
>>> +void qaic_ras_unregister(void);
>>
>> new line?
> 
> Ok.
> 
>>> +#endif /* __QAIC_RAS_H__ */
>>
> 



More information about the dri-devel mailing list