[Intel-xe] [PATCH v2] drm/xe/mmio: update gt_count when probing multi-tile

Ofir Bitton obitton at habana.ai
Thu Jul 13 13:55:36 UTC 2023


On 11/07/2023 12:15, Matthew Auld wrote:
> On 11/07/2023 07:54, Ofir Bitton wrote:
>> On 05/07/2023 13:19, Matthew Auld wrote:
>>> Hi,
>>>
>>> On 05/07/2023 10:50, Ofir Bitton wrote:
>>>> On 26/06/2023 20:20, Matthew Auld wrote:
>>>>> It looks like the single-tile PVC in CI dies during module load when
>>>>> doing
>>>>> the pcode init. From the logs we try to access the address
>>>>> 0000000000138124 which doesn't map to anything, however 0x138124 also
>>>>> looks to be the PCODE_MAILBOX register. So looks like the per-tile
>>>>> mmio register mapping is NULL.
>>>>>
>>>>> During probe the tile count is potentially trimmed, since we don't
>>>>> know
>>>>> the real count until we actually probe the device. This seems to be
>>>>> the case for single-tile PVC or similar devices.  However it looks
>>>>> like
>>>>> the gt_count is never adjusted to respect this updated tile count.
>>>>> As a
>>>>> result when later doing some for_each_gt() loop, like we do for the
>>>>> pcode, we can get back some GT that maps to some non-existent tile
>>>>> which hasn't been properly set up, leading to crashes.
>>>>>
>>>>> Try to fix this by adjusting the gt_count after probing the tiles for
>>>>> real.
>>>>>
>>>>> v2: Fix typo so it actually builds
>>>>>
>>>>> Closes:
>>>>> https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/383
>>>>> Signed-off-by: Matthew Auld <matthew.auld at intel.com>
>>>>> Cc: Lucas De Marchi <lucas.demarchi at intel.com>
>>>>> Cc: Matt Roper <matthew.d.roper at intel.com>
>>>>> ---
>>>>>     drivers/gpu/drm/xe/xe_mmio.c | 6 ++++++
>>>>>     1 file changed, 6 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/xe/xe_mmio.c
>>>>> b/drivers/gpu/drm/xe/xe_mmio.c
>>>>> index f1336803b915..8150bf6f3578 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_mmio.c
>>>>> +++ b/drivers/gpu/drm/xe/xe_mmio.c
>>>>> @@ -334,6 +334,12 @@ static void xe_mmio_probe_tiles(struct xe_device
>>>>> *xe)
>>>>>         adj_tile_count = xe->info.tile_count =
>>>>>                 REG_FIELD_GET(TILE_COUNT, mtcfg) + 1;
>>>>>
>>>>> +     /*
>>>>> +      * FIXME: Needs some work for standalone media, but should be
>>>>> impossible
>>>>> +      * with multi-tile for now.
>>>>> +      */
>>>>> +     xe->info.gt_count = xe->info.tile_count;
>>>>> +
>>>>>         drm_info(&xe->drm, "tile_count: %d, adj_tile_count %d\n",
>>>>>                  xe->info.tile_count, adj_tile_count);
>>>>>
>>>>
>>>> 'xe->info.gt_count' is getting incremented in 'xe_info_init' during
>>>> probe, seems you must remove it from there, or else 'gt_count' will
>>>> eventually be equal to x2 of the desired value.
>>>
>>> xe_info_init() looks to be called first. But this is very early during
>>> probe and might not accurately represent the actual HW, so some of it
>>> appears to be "placeholder" only, like with tile_count. Once we
>>> eventually initialise enough of the MMIO stuff, we can actually probe
>>> the HW to figure out the real number of tiles, as per
>>> xe_mmio_probe_tiles() here. At this stage we might need to update
>>> tile_count, but we missed also updating the gt_count, leading to
>>> explosions if the real vs placeholder tile_count is ever different.
>>>
>>
>> I see, so do you see any reason to keep 'xe->info.gt_count'
>> initialization in 'xe_info_init'? anyway you set it here and it is not
>> used in between.
>
> AFAICT xe_info_init() turns the static description (graphics_desc +
> media_desc) of the platform into an xe_info struct, which gets handed to
> the next phase of driver load, and in the case of tile_count it is just
> a placeholder since that information is not static. There looks to be
> various things like media_desc which can also influence the number of
> GTs. I'm not sure where that should instead be handled?
>
> One idea is to very carefully probe the tile_count during
> xe_info_init(), sort of how handle_gmdid() does things. But I'm not
> certain if we are even allowed to do that this early.
>
> Adjusting gt_count when we later adjust tile_count looked to be the
> minimal fix here, but happy to try to some other ideas.
>

ok, I see many occurrences where xe->info is overwritten here, this will
requires some more thorough work, meanwhile I think we can go ahead with
this patch.

Reviewed-by: Ofir Bitton <obitton at habana.ai>
>>
>>>>
>>>> --
>>>> Ofir
>>



More information about the Intel-xe mailing list