[Intel-xe] [RFC] drm/xe: Introduce xe_ASSERT macros

Fri Aug 4 20:47:48 UTC 2023

On Tue, Aug 01, 2023 at 11:19:07AM +0300, Jani Nikula wrote:
> On Fri, 28 Jul 2023, Michal Wajdeczko <michal.wajdeczko at intel.com> wrote:
> > On 28.07.2023 21:47, Vivi, Rodrigo wrote:
> >> On Fri, 2023-07-28 at 21:25 +0200, Michal Wajdeczko wrote:
> >>> As we are moving away from the controversial XE_BUG_ON macro,
> >>> relying just on WARN_ON or drm_err does not cover the cases
> >>> where we want to annotate functions with additional detailed
> >>> debug checks to assert that all prerequisites are satisfied,
> >>> without paying footprint or performance penalty on non-debug
> >>> builds, where all misuses introduced during code integration
> >>> were already fixed.
> >>>
> >>> Introduce family of xe_ASSERT macros that try to follow classic
> >>> assert() utility and can be compiled out on non-debug builds.
> >> 
> >> This is indeed much better and clear then the XE_BUG_ON.
> >> 
> >> Also, let me be clear that my intention with reply is not to
> >> block the introduction of this assert macro.
> >
> > thanks
> >
> >> 
> >> But I'd like to raise my concern(s?) with such assert macros:
> >> 
> >> In many cases that we currently use these macros are actually
> >> very useful on production build. Whenever we receive bug reports
> >> we should be able to ask the reporter to get a dmesg or enable
> >> a drm.debug and get a message and we could see these, without
> >> ever having to ask the report to rebuild the kernel with a debug
> >> config.
> >
> > you should think of xe_assert() as extension to static_assert() but
> > aimed to be executed during internal CI builds only, when we run basic
> > integration tests, and were we are able to quickly catch all misuses,
> > caused by rapid or multiple changes going in.
> >
> >> 
> >> Then, we add this in the code and we start using this everywhere
> >> without stopping to think if that would be good in a report and
> >> we end up with a code full of asserts and big problem to get the
> >> useful information from the bug reports.
> >
> > those asserts shouldn't fire beyond integration builds - if someone uses
> > assert() to tag unfinished work, then it's simply wrong, in such cases
> > true WARN and correct fallback shall be implemented - and that's the
> > role of the reviewers to spot it ;)
> 
> Naming this ASSERT is the first step towards proper usage, because it
> implies it might be different for regular and debug builds.
> 
> The second step would be adding a comment documenting the usage.

Yes, please. Let's add the documentation so it gets easier to everyone
to understand the rules and help during reviews to avoid the same
mistake written below.

With the doc in place:

Acked-by: Rodrigo Vivi <rodrigo.vivi at intel.com>

> 
> I think in general the idea behind the conditional GEM_BUG_ONs was to
> avoid the performance impact of leaving them in place. But then we've
> started sprinkling GEM_BUG_ON indiscriminately, because you didn't have
> to think about performance. 1200+ instances of GEM_BUG_ON in i915. It's
> just crazy.
> 
> >
> >> 
> >> Aside from the fact that we end up going far from other drm and
> >> non-drm kernel drivers and starting from day 0 a driver full of
> >> driver-isms.
> >
> > maybe because our driver(s) are bigger and more complex ? also much more
> > developers are working on it, with multiple refactoring going constantly
> > on, so we should have mechanism to allow us catching unexpected uses
> > without polluting production driver too much.
> >
> > btw, DRM already has assert() but named as bug()
> 
> Do git blame on that and realize it originated from us following the
> same pattern as GEM_BUG_ON, which got copy-pasted as XE_BUG_ON. It's
> just problematic all around, and upstream consensus is that BUG_ON() is
> not even for CI. Use warn and panic_on_warn instead.
> 
> BR,
> Jani.
> 
> >
> > [1] https://elixir.bootlin.com/linux/latest/C/ident/DRM_MM_BUG_ON
> >
> >> 
> >> I would prefer we have a driver where we make usage of the all
> >> the drm error, warning, debug, info variants on a case by case,
> >> carefully analyzing the needs and thinking about the bug reports
> >> that we might receive from end users.
> >
> > +1
> >
> > it would be also nice if driver is correctly handling all errors or
> > invalid input paths, but at the same time, it shouldn't duplicate checks
> > that shall be already done at higher layer/caller - with asserts() we
> > just express our expectations without ignoring other bugs by trying
> > handle that unexpected case down the stack
> >
> >> 
> >>>
> >>> Macros are based on drm_WARN, but unlikely to origin, disallow
> >>> use in expressions since we will compile that code out.
> >>>
> >>> As we are operating on the xe pointers, we can print additional
> >>> information about the device, like tile or GT identifier, that
> >>> is not available from generic WARN report:
> >>>
> >>> [ ] xe 0000:00:02.0: [drm] Assertion `false` failed!
> >>>     graphics: Xe_LP 12.0 media: Xe_M 12.0 GT0
> >>> [ ] WARNING: CPU: 4 PID: 8442 at drivers/gpu/drm/xe/xe_device.c:280
> >>> xe_device_probe+0x2f5/0x4a0 [xe]
> >>> [ ] RIP: 0010:xe_device_probe+0x2f5/0x4a0 [xe]
> >>>
> >>> Signed-off-by: Michal Wajdeczko <michal.wajdeczko at intel.com>
> >>> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
> >>> Cc: Francois Dugast <francois.dugast at intel.com>
> >>> Cc: Matthew Brost <matthew.brost at intel.com>
> >>> ---
> >>>  drivers/gpu/drm/xe/xe_assert.h | 47
> >>> ++++++++++++++++++++++++++++++++++
> >>>  1 file changed, 47 insertions(+)
> >>>  create mode 100644 drivers/gpu/drm/xe/xe_assert.h
> >>>
> >>> diff --git a/drivers/gpu/drm/xe/xe_assert.h
> >>> b/drivers/gpu/drm/xe/xe_assert.h
> >>> new file mode 100644
> >>> index 000000000000..65306768d637
> >>> --- /dev/null
> >>> +++ b/drivers/gpu/drm/xe/xe_assert.h
> >>> @@ -0,0 +1,47 @@
> >>> +/* SPDX-License-Identifier: MIT */
> >>> +/*
> >>> + * Copyright © 2023 Intel Corporation
> >>> + */
> >>> +
> >>> +#ifndef __XE_ASSERT_H__
> >>> +#define __XE_ASSERT_H__
> >>> +
> >>> +#if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
> >>> +
> >>> +#define xe_ASSERT_MSG(xe, condition, fmt, arg...)
> >>> ({                                           \
> >>> +       struct xe_device *__xe =
> >>> (xe);                                                          \
> >>> +       int __passed =
> >>> !!(condition);                                                       
> >>>     \
> >>> +       drm_WARN(&__xe->drm, !__passed, "[" DRM_NAME "] Assertion
> >>> `%s` failed!\n"               \
> >>> +                "graphics: %s %u.%u media: %s %u.%u "
> >>> fmt,                                     \
> >>> +               
> >>> __stringify(condition),                                              
> >>>           \
> >>> +                __xe-
> >>>> info.graphics_name,                                                 
> >>>      \
> >>> +                __xe->info.graphics_verx100 / 100, __xe-
> >>>> info.graphics_verx100 % 100,          \
> >>> +                __xe-
> >>>> info.media_name,                                                    
> >>>      \
> >>> +                __xe->info.media_verx100 / 100, __xe-
> >>>> info.media_verx100 % 100,                \
> >>> +                ##
> >>> arg);                                                                
> >>>        \
> >>> +       (void)(__passed);                                            
> >>>                            \
> >>> +})
> >>> +
> >>> +#else
> >>> +
> >>> +#define xe_ASSERT_MSG(xe, condition, fmt, arg...)
> >>> ({                                           \
> >>> +       typecheck(struct xe_device *,
> >>> xe);                                                      \
> >>> +       BUILD_BUG_ON_INVALID(condition);                             
> >>>                            \
> >>> +})
> >>> +
> >>> +#endif
> >>> +
> >>> +#define xe_ASSERT(xe, condition) \
> >>> +       xe_ASSERT_MSG((xe), condition, "")
> >>> +
> >>> +#define xe_tile_ASSERT(tile, condition)
> >>> ({                                                     \
> >>> +       struct xe_tile *__tile =
> >>> (tile);                                                        \
> >>> +       xe_ASSERT_MSG(tile_to_xe(__tile), condition, "tile%u ",
> >>> __tile->id);                    \
> >>> +})
> >>> +
> >>> +#define xe_gt_ASSERT(gt, condition)
> >>> ({                                                         \
> >>> +       struct xe_gt *__gt =
> >>> (gt);                                                              \
> >>> +       xe_ASSERT_MSG(gt_to_xe(__gt), condition, "GT%u ", __gt-
> >>>> info.id);                       \
> >>> +})
> >>> +
> >>> +#endif /* __XE_ASSERT_H__ */
> >> 
> 
> -- 
> Jani Nikula, Intel Open Source Graphics Center