[PATCH 1/2] drm/amdgpu: make duplicated EOP packet for GFX7/8 have real content

Mon Jun 17 14:42:26 UTC 2024

Am 17.06.24 um 16:30 schrieb Icenowy Zheng:
> 在 2024-06-17星期一的 15:59 +0200，Christian König写道：
>> Am 17.06.24 um 15:43 schrieb Icenowy Zheng:
>>> 在 2024-06-17星期一的 15:09 +0200，Christian König写道：
>>>> Am 17.06.24 um 15:03 schrieb Icenowy Zheng:
>>>>> 在 2024-06-17星期一的 14:35 +0200，Christian König写道：
>>>>>> Am 17.06.24 um 12:58 schrieb Icenowy Zheng:
>>>>>>> The duplication of EOP packets for GFX7/8, with the former
>>>>>>> one
>>>>>>> have
>>>>>>> seq-1 written and the latter one have seq written, seems to
>>>>>>> confuse
>>>>>>> some
>>>>>>> hardware platform (e.g. Loongson 7A series PCIe
>>>>>>> controllers).
>>>>>>>
>>>>>>> Make the content of the duplicated EOP packet the same with
>>>>>>> the
>>>>>>> real
>>>>>>> one, only masking any possible interrupts.
>>>>>> Well completely NAK to that, exactly that disables the
>>>>>> workaround.
>>>>>>
>>>>>> The CPU needs to see two different values written here.
>>>>> Why do the CPU need to see two different values here? Only the
>>>>> second
>>>>> packet will raise an interrupt before and after applying this
>>>>> patch,
>>>>> and the first packet's result should just be overriden on
>>>>> ordinary
>>>>> platforms. The CPU won't see the first one, until it's polling
>>>>> for
>>>>> the
>>>>> address for a very short interval, so short that the GPU CP
>>>>> couldn't
>>>>> execute 2 commands.
>>>> Yes exactly that. We need to make two writes, one with the old
>>>> value
>>>> (seq - 1) and a second with the real value (seq).
>>>>
>>>> Otherwise it is possible that a polling CPU would see the
>>>> sequence
>>>> before the second EOP is issued with results in incoherent view
>>>> of
>>>> memory.
>>> In this case shouldn't we write seq-1 before any work, and then
>>> write
>>> seq after work, like what is done in Mesa?
>> No. This hw workaround requires that two consecutive write operations
>> happen directly behind each other on the PCIe bus with two different
>> values.
> Well to be honest the workaround code in Mesa seems to not be working
> in this way ...

Mesa doesn't have any workaround for that hw issue, the code there uses 
a quite different approach.

>> To make the software logic around that work without any changes we
>> use
>> the values seq - 1 and seq because those are guaranteed to be
>> different
>> and not trigger any unwanted software behavior.
>>
>> Only then we can guarantee that we have a coherent view of system
>> memory.
> Any more details about it?

No, sorry. All I know is that it's a bug in the cache flush logic which 
can be worked around by issuing two write behind each other to the same 
location.

> BTW in this case, could I try to write it for 3 times instead of 2,
> with seq-1, seq and seq?

That could potentially work as well, but at some point we would need to 
increase the EOP ring buffer size or could run into performance issues.

>>> As what I see, Mesa uses another command buffer to emit a
>>> EVENT_WRITE_EOP writing 0, and commit this command buffer before
>>> the
>>> real command buffer.
>>>
>>>>> Or do you mean the GPU needs to see two different values being
>>>>> written,
>>>>> or they will be merged into only one write request?
>>>>>
>>>>> Please give out more information about this workaround,
>>>>> otherwise
>>>>> the
>>>>> GPU hang problem on Loongson platforms will persist.
>>>> Well if Loongson can't handle two consecutive write operations to
>>>> the
>>>> same address with different values then you have a massive
>>>> platform
>>>> bug.
>>> I think the issue is triggered when two consecutive write
>>> operations
>>> and one IRQ is present, which is exactly the case of this function.
>> Well then you have a massive platform bug.
>>
>> Two consecutive writes to the same bus address are perfectly legal
>> from
>> the PCIe specification and can happen all the time, even without this
>> specific hw workaround.
> Yes I know it, and I am not from Loongson, just some user trying to
> mess around it.

Well to be honest on a platform where even two consecutive writes to the 
same location doesn't work I would have strong doubts that it is stable 
in general.

Regards,
Christian.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20240617/a18f26f5/attachment.htm>