[Intel-xe] [PATCH v2] drm/xe/mmio: update gt_count when probing multi-tile
Matthew Auld
matthew.auld at intel.com
Tue Jul 11 09:15:54 UTC 2023
On 11/07/2023 07:54, Ofir Bitton wrote:
> On 05/07/2023 13:19, Matthew Auld wrote:
>> Hi,
>>
>> On 05/07/2023 10:50, Ofir Bitton wrote:
>>> On 26/06/2023 20:20, Matthew Auld wrote:
>>>> It looks like the single-tile PVC in CI dies during module load when
>>>> doing
>>>> the pcode init. From the logs we try to access the address
>>>> 0000000000138124 which doesn't map to anything, however 0x138124 also
>>>> looks to be the PCODE_MAILBOX register. So looks like the per-tile
>>>> mmio register mapping is NULL.
>>>>
>>>> During probe the tile count is potentially trimmed, since we don't know
>>>> the real count until we actually probe the device. This seems to be
>>>> the case for single-tile PVC or similar devices. However it looks like
>>>> the gt_count is never adjusted to respect this updated tile count. As a
>>>> result when later doing some for_each_gt() loop, like we do for the
>>>> pcode, we can get back some GT that maps to some non-existent tile
>>>> which hasn't been properly set up, leading to crashes.
>>>>
>>>> Try to fix this by adjusting the gt_count after probing the tiles for
>>>> real.
>>>>
>>>> v2: Fix typo so it actually builds
>>>>
>>>> Closes:
>>>> https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/383
>>>> Signed-off-by: Matthew Auld <matthew.auld at intel.com>
>>>> Cc: Lucas De Marchi <lucas.demarchi at intel.com>
>>>> Cc: Matt Roper <matthew.d.roper at intel.com>
>>>> ---
>>>> drivers/gpu/drm/xe/xe_mmio.c | 6 ++++++
>>>> 1 file changed, 6 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_mmio.c b/drivers/gpu/drm/xe/xe_mmio.c
>>>> index f1336803b915..8150bf6f3578 100644
>>>> --- a/drivers/gpu/drm/xe/xe_mmio.c
>>>> +++ b/drivers/gpu/drm/xe/xe_mmio.c
>>>> @@ -334,6 +334,12 @@ static void xe_mmio_probe_tiles(struct xe_device
>>>> *xe)
>>>> adj_tile_count = xe->info.tile_count =
>>>> REG_FIELD_GET(TILE_COUNT, mtcfg) + 1;
>>>>
>>>> + /*
>>>> + * FIXME: Needs some work for standalone media, but should be
>>>> impossible
>>>> + * with multi-tile for now.
>>>> + */
>>>> + xe->info.gt_count = xe->info.tile_count;
>>>> +
>>>> drm_info(&xe->drm, "tile_count: %d, adj_tile_count %d\n",
>>>> xe->info.tile_count, adj_tile_count);
>>>>
>>>
>>> 'xe->info.gt_count' is getting incremented in 'xe_info_init' during
>>> probe, seems you must remove it from there, or else 'gt_count' will
>>> eventually be equal to x2 of the desired value.
>>
>> xe_info_init() looks to be called first. But this is very early during
>> probe and might not accurately represent the actual HW, so some of it
>> appears to be "placeholder" only, like with tile_count. Once we
>> eventually initialise enough of the MMIO stuff, we can actually probe
>> the HW to figure out the real number of tiles, as per
>> xe_mmio_probe_tiles() here. At this stage we might need to update
>> tile_count, but we missed also updating the gt_count, leading to
>> explosions if the real vs placeholder tile_count is ever different.
>>
>
> I see, so do you see any reason to keep 'xe->info.gt_count'
> initialization in 'xe_info_init'? anyway you set it here and it is not
> used in between.
AFAICT xe_info_init() turns the static description (graphics_desc +
media_desc) of the platform into an xe_info struct, which gets handed to
the next phase of driver load, and in the case of tile_count it is just
a placeholder since that information is not static. There looks to be
various things like media_desc which can also influence the number of
GTs. I'm not sure where that should instead be handled?
One idea is to very carefully probe the tile_count during
xe_info_init(), sort of how handle_gmdid() does things. But I'm not
certain if we are even allowed to do that this early.
Adjusting gt_count when we later adjust tile_count looked to be the
minimal fix here, but happy to try to some other ideas.
>
>>>
>>> --
>>> Ofir
>
More information about the Intel-xe
mailing list