[PATCH i-g-t] lib/intel_device_info: Use per-thread device info cache

Zbigniew Kempczyński zbigniew.kempczynski at intel.com
Thu Nov 21 17:13:52 UTC 2024


On Wed, Nov 20, 2024 at 02:35:46PM -0800, Matt Roper wrote:
> Relying on static variables to act as a cache inside the device info
> lookup function is racy for multi-threaded tests since there's only one
> copy of the static variables shared by all threads.  E.g.,
> 
>     Thread 1                Thread 2
>     ========                ========
>     cached_devid = devid;
>                             if (cached_devid == devid)
>                             return cache;
>     cache = ...
> 
> Indeed, running xe_exec_threads repeatedly turns up cases where the test
> fails with "Platform is missing PAT settings for uc/wt/wb" because a
> racing thread wound up getting 'intel_generic_info' rather than a proper
> device info structure (although it can take a couple thousand tries to
> hit this specific race condition).  CI also sees this from time to time,
> but the bugs often get closed as "cannot reproduce" because the failure
> rate is so low.
> 
> We're probably going to be reworking this code completely in the
> not-too-distant future since we really need to be looking things up from
> the GMD_ID query rather than using hardcoded device ID table lookups,
> but for now we can apply a quick fix of just making 'cached_devid' and
> 'cache' thread-local so that we're not racing on a  shared copy.  Simply
> swapping the order of the assignment of 'cache' and 'cached_devid'
> probably would have also worked around the specific race condition noted
> above, but it would have left tests susceptible to other race conditions
> in cases where a multi-threaded test is actually looking up different
> device IDs / GPUs in its multiple threads.
> 
> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/1600
> References: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/1535
> References: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3236
> Signed-off-by: Matt Roper <matthew.d.roper at intel.com>
> ---
>  lib/intel_device_info.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/intel_device_info.c b/lib/intel_device_info.c
> index 546b9c65a..d38dc415d 100644
> --- a/lib/intel_device_info.c
> +++ b/lib/intel_device_info.c
> @@ -658,8 +658,8 @@ static const struct pci_id_match intel_device_match[] = {
>   */
>  const struct intel_device_info *intel_get_device_info(uint16_t devid)
>  {
> -	static const struct intel_device_info *cache = &intel_generic_info;
> -	static uint16_t cached_devid;
> +	static __thread const struct intel_device_info *cache = &intel_generic_info;
> +	static __thread uint16_t cached_devid;

Makes sense, parallel access to same area when more threads may
overwrite this while others are reading is racy:

Reviewed-by: Zbigniew Kempczyński <zbigniew.kempczynski at intel.com>

--
Zbigniew

>  	int i;
>  
>  	if (cached_devid == devid)
> -- 
> 2.47.0
> 


More information about the igt-dev mailing list