[Intel-xe] [RFC] drm/xe: Introduce xe_ASSERT macros

Tue Aug 1 08:19:07 UTC 2023

On Fri, 28 Jul 2023, Michal Wajdeczko <michal.wajdeczko at intel.com> wrote:
> On 28.07.2023 21:47, Vivi, Rodrigo wrote:
>> On Fri, 2023-07-28 at 21:25 +0200, Michal Wajdeczko wrote:
>>> As we are moving away from the controversial XE_BUG_ON macro,
>>> relying just on WARN_ON or drm_err does not cover the cases
>>> where we want to annotate functions with additional detailed
>>> debug checks to assert that all prerequisites are satisfied,
>>> without paying footprint or performance penalty on non-debug
>>> builds, where all misuses introduced during code integration
>>> were already fixed.
>>>
>>> Introduce family of xe_ASSERT macros that try to follow classic
>>> assert() utility and can be compiled out on non-debug builds.
>> 
>> This is indeed much better and clear then the XE_BUG_ON.
>> 
>> Also, let me be clear that my intention with reply is not to
>> block the introduction of this assert macro.
>
> thanks
>
>> 
>> But I'd like to raise my concern(s?) with such assert macros:
>> 
>> In many cases that we currently use these macros are actually
>> very useful on production build. Whenever we receive bug reports
>> we should be able to ask the reporter to get a dmesg or enable
>> a drm.debug and get a message and we could see these, without
>> ever having to ask the report to rebuild the kernel with a debug
>> config.
>
> you should think of xe_assert() as extension to static_assert() but
> aimed to be executed during internal CI builds only, when we run basic
> integration tests, and were we are able to quickly catch all misuses,
> caused by rapid or multiple changes going in.
>
>> 
>> Then, we add this in the code and we start using this everywhere
>> without stopping to think if that would be good in a report and
>> we end up with a code full of asserts and big problem to get the
>> useful information from the bug reports.
>
> those asserts shouldn't fire beyond integration builds - if someone uses
> assert() to tag unfinished work, then it's simply wrong, in such cases
> true WARN and correct fallback shall be implemented - and that's the
> role of the reviewers to spot it ;)

Naming this ASSERT is the first step towards proper usage, because it
implies it might be different for regular and debug builds.

The second step would be adding a comment documenting the usage.

I think in general the idea behind the conditional GEM_BUG_ONs was to
avoid the performance impact of leaving them in place. But then we've
started sprinkling GEM_BUG_ON indiscriminately, because you didn't have
to think about performance. 1200+ instances of GEM_BUG_ON in i915. It's
just crazy.

>
>> 
>> Aside from the fact that we end up going far from other drm and
>> non-drm kernel drivers and starting from day 0 a driver full of
>> driver-isms.
>
> maybe because our driver(s) are bigger and more complex ? also much more
> developers are working on it, with multiple refactoring going constantly
> on, so we should have mechanism to allow us catching unexpected uses
> without polluting production driver too much.
>
> btw, DRM already has assert() but named as bug()

Do git blame on that and realize it originated from us following the
same pattern as GEM_BUG_ON, which got copy-pasted as XE_BUG_ON. It's
just problematic all around, and upstream consensus is that BUG_ON() is
not even for CI. Use warn and panic_on_warn instead.

BR,
Jani.

>
> [1] https://elixir.bootlin.com/linux/latest/C/ident/DRM_MM_BUG_ON
>
>> 
>> I would prefer we have a driver where we make usage of the all
>> the drm error, warning, debug, info variants on a case by case,
>> carefully analyzing the needs and thinking about the bug reports
>> that we might receive from end users.
>
> +1
>
> it would be also nice if driver is correctly handling all errors or
> invalid input paths, but at the same time, it shouldn't duplicate checks
> that shall be already done at higher layer/caller - with asserts() we
> just express our expectations without ignoring other bugs by trying
> handle that unexpected case down the stack
>
>> 
>>>
>>> Macros are based on drm_WARN, but unlikely to origin, disallow
>>> use in expressions since we will compile that code out.
>>>
>>> As we are operating on the xe pointers, we can print additional
>>> information about the device, like tile or GT identifier, that
>>> is not available from generic WARN report:
>>>
>>> [ ] xe 0000:00:02.0: [drm] Assertion `false` failed!
>>>     graphics: Xe_LP 12.0 media: Xe_M 12.0 GT0
>>> [ ] WARNING: CPU: 4 PID: 8442 at drivers/gpu/drm/xe/xe_device.c:280
>>> xe_device_probe+0x2f5/0x4a0 [xe]
>>> [ ] RIP: 0010:xe_device_probe+0x2f5/0x4a0 [xe]
>>>
>>> Signed-off-by: Michal Wajdeczko <michal.wajdeczko at intel.com>
>>> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
>>> Cc: Francois Dugast <francois.dugast at intel.com>
>>> Cc: Matthew Brost <matthew.brost at intel.com>
>>> ---
>>>  drivers/gpu/drm/xe/xe_assert.h | 47
>>> ++++++++++++++++++++++++++++++++++
>>>  1 file changed, 47 insertions(+)
>>>  create mode 100644 drivers/gpu/drm/xe/xe_assert.h
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_assert.h
>>> b/drivers/gpu/drm/xe/xe_assert.h
>>> new file mode 100644
>>> index 000000000000..65306768d637
>>> --- /dev/null
>>> +++ b/drivers/gpu/drm/xe/xe_assert.h
>>> @@ -0,0 +1,47 @@
>>> +/* SPDX-License-Identifier: MIT */
>>> +/*
>>> + * Copyright © 2023 Intel Corporation
>>> + */
>>> +
>>> +#ifndef __XE_ASSERT_H__
>>> +#define __XE_ASSERT_H__
>>> +
>>> +#if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
>>> +
>>> +#define xe_ASSERT_MSG(xe, condition, fmt, arg...)
>>> ({                                           \
>>> +       struct xe_device *__xe =
>>> (xe);                                                          \
>>> +       int __passed =
>>> !!(condition);                                                       
>>>     \
>>> +       drm_WARN(&__xe->drm, !__passed, "[" DRM_NAME "] Assertion
>>> `%s` failed!\n"               \
>>> +                "graphics: %s %u.%u media: %s %u.%u "
>>> fmt,                                     \
>>> +               
>>> __stringify(condition),                                              
>>>           \
>>> +                __xe-
>>>> info.graphics_name,                                                 
>>>      \
>>> +                __xe->info.graphics_verx100 / 100, __xe-
>>>> info.graphics_verx100 % 100,          \
>>> +                __xe-
>>>> info.media_name,                                                    
>>>      \
>>> +                __xe->info.media_verx100 / 100, __xe-
>>>> info.media_verx100 % 100,                \
>>> +                ##
>>> arg);                                                                
>>>        \
>>> +       (void)(__passed);                                            
>>>                            \
>>> +})
>>> +
>>> +#else
>>> +
>>> +#define xe_ASSERT_MSG(xe, condition, fmt, arg...)
>>> ({                                           \
>>> +       typecheck(struct xe_device *,
>>> xe);                                                      \
>>> +       BUILD_BUG_ON_INVALID(condition);                             
>>>                            \
>>> +})
>>> +
>>> +#endif
>>> +
>>> +#define xe_ASSERT(xe, condition) \
>>> +       xe_ASSERT_MSG((xe), condition, "")
>>> +
>>> +#define xe_tile_ASSERT(tile, condition)
>>> ({                                                     \
>>> +       struct xe_tile *__tile =
>>> (tile);                                                        \
>>> +       xe_ASSERT_MSG(tile_to_xe(__tile), condition, "tile%u ",
>>> __tile->id);                    \
>>> +})
>>> +
>>> +#define xe_gt_ASSERT(gt, condition)
>>> ({                                                         \
>>> +       struct xe_gt *__gt =
>>> (gt);                                                              \
>>> +       xe_ASSERT_MSG(gt_to_xe(__gt), condition, "GT%u ", __gt-
>>>> info.id);                       \
>>> +})
>>> +
>>> +#endif /* __XE_ASSERT_H__ */
>> 

-- 
Jani Nikula, Intel Open Source Graphics Center