[PATCH] drm/xe/dg2: Fix number of mocs entries

Thu Feb 22 21:53:56 UTC 2024

On Thu, Feb 22, 2024 at 01:16:39PM -0800, Matt Roper wrote:
>On Thu, Feb 22, 2024 at 12:42:03PM -0800, Lucas De Marchi wrote:
>> While investigating why xe_mocs kunit was failing for DG2, Matt Roper
>> noticed that LNCFCMOCS[31] is actually RO for DG2. It would be a spec
>> bug to document a register to be programmed that is actually RO. However
>> upon further inspection, it seems like we should not be programming the
>> other values. Early DG2 SKUs documented that the entire set of
>> LNCFCMOCS registers needed to be programmed, but this is not true
>> anymore.
>
>Wait...it's not the bspec that required us to program every index, but
>rather the Linux-specific need to ensure we don't break ABI in the
>future.  Although rare, it's consisdered legal for the hardware team to
>add additional documented MOCS index values to the table as long as they
>aren't changing/removing any of the pre-existing values.  If we have
>(buggy) userspace today that's already using an undocumented MOCS index
>and operating properly (due to luck), adding the new "official" settings
>based on a future bspec update could regress userspace if those new
>settings do something like reduce the coherency.  So to guard against
>this, we always need to program every undocumented index to maximum
>caching, minimum coherency.  That way, if [buggy] userspace can operate
>properly with those settings, it should continue to operate in a
>functionally correct manner if the index suddenly becomes more coherent
>in a future spec update.

from checking past platforms and the updates I was actually under the
impression that the table  in e.g. bspec 45101 would always document the
possible indexes the HW can use, leaving holes to complete it. And those
*holes*  would be the ones to change for future use.

from what you said above, it seems it's not really the case?

>
>>
>> Do like is done in MTL: the additional registers are there, but we are
>> not supposed to program them, not even with the unused_index value.
>
>It looks like this was actually a mistake on MTL that was overlooked;
>the bspec documents only 16 entries, but the hardware supports 32, so we
>should be explicitly programming all the unused ones as mentioned above
>to protect against future table updates.
>
>
>At this point I think we're far enough along that it's extremely
>unlikely these MOCS tables for Xe1 platforms like DG2 and MTL are going
>to have any further spec additions, so this MTL mistake wound up not
>being a big problem for us.  But replacing XELP_NUM_MOCS_ENTRIES with
>XEHP_NUM_MOCS_ENTRIES on DG2 is going to extend that mistake farther and
>make it more likely for someone to copy/paste the same logic in future
>platforms.
>
>The fact that DG2's entry #63 is RO and can't be reprogrammed by the KMD
>also means that it's 100% impossible for a future bspec update to
>request different values for it.  So whether or not we try to write the
>"unused entry" value into it doesn't really matter (it's easiest to just
>keep doing so, especially due to the way two MOCS indices are packed
>into the same register), but the test should be updated to not try to
>verify that last entry.

what I don't like is special casing the test, because the driver tried
to write a WO value.

Would reducing the number of entries for DG2 like

	/*
	 * last entry is RO on hardware, don't bother with what was
	 * written when checking later
	 */
	info->n_entries = XELP_NUM_MOCS_ENTRIES - 1;

(also increasing it for MTL, for consistency),  and then making the test
do the right thing for odd number of entries be ok in your opinion?
get_entry_l3cc() already does the right thing, but we are not ignoring
the value read back.

Lucas De Marchi