[PATCH i-g-t] lib/intel_device_info: Use per-thread device info cache
Zbigniew Kempczyński
zbigniew.kempczynski at intel.com
Thu Nov 21 17:13:52 UTC 2024
On Wed, Nov 20, 2024 at 02:35:46PM -0800, Matt Roper wrote:
> Relying on static variables to act as a cache inside the device info
> lookup function is racy for multi-threaded tests since there's only one
> copy of the static variables shared by all threads. E.g.,
>
> Thread 1 Thread 2
> ======== ========
> cached_devid = devid;
> if (cached_devid == devid)
> return cache;
> cache = ...
>
> Indeed, running xe_exec_threads repeatedly turns up cases where the test
> fails with "Platform is missing PAT settings for uc/wt/wb" because a
> racing thread wound up getting 'intel_generic_info' rather than a proper
> device info structure (although it can take a couple thousand tries to
> hit this specific race condition). CI also sees this from time to time,
> but the bugs often get closed as "cannot reproduce" because the failure
> rate is so low.
>
> We're probably going to be reworking this code completely in the
> not-too-distant future since we really need to be looking things up from
> the GMD_ID query rather than using hardcoded device ID table lookups,
> but for now we can apply a quick fix of just making 'cached_devid' and
> 'cache' thread-local so that we're not racing on a shared copy. Simply
> swapping the order of the assignment of 'cache' and 'cached_devid'
> probably would have also worked around the specific race condition noted
> above, but it would have left tests susceptible to other race conditions
> in cases where a multi-threaded test is actually looking up different
> device IDs / GPUs in its multiple threads.
>
> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/1600
> References: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/1535
> References: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3236
> Signed-off-by: Matt Roper <matthew.d.roper at intel.com>
> ---
> lib/intel_device_info.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/lib/intel_device_info.c b/lib/intel_device_info.c
> index 546b9c65a..d38dc415d 100644
> --- a/lib/intel_device_info.c
> +++ b/lib/intel_device_info.c
> @@ -658,8 +658,8 @@ static const struct pci_id_match intel_device_match[] = {
> */
> const struct intel_device_info *intel_get_device_info(uint16_t devid)
> {
> - static const struct intel_device_info *cache = &intel_generic_info;
> - static uint16_t cached_devid;
> + static __thread const struct intel_device_info *cache = &intel_generic_info;
> + static __thread uint16_t cached_devid;
Makes sense, parallel access to same area when more threads may
overwrite this while others are reading is racy:
Reviewed-by: Zbigniew Kempczyński <zbigniew.kempczynski at intel.com>
--
Zbigniew
> int i;
>
> if (cached_devid == devid)
> --
> 2.47.0
>
More information about the igt-dev
mailing list