[PATCH 1/2] drm/amdgpu: make duplicated EOP packet for GFX7/8 have real content

Mon Jun 17 15:07:14 UTC 2024

Am 17.06.24 um 16:57 schrieb Icenowy Zheng:
> 在 2024-06-17星期一的 16:42 +0200，Christian König写道：
>> Am 17.06.24 um 16:30 schrieb Icenowy Zheng:
>>> 在 2024-06-17星期一的 15:59 +0200，Christian König写道：
>>>> Am 17.06.24 um 15:43 schrieb Icenowy Zheng:
>>>>> 在 2024-06-17星期一的 15:09 +0200，Christian König写道：
>>>>>> ...
>>>>> In this case shouldn't we write seq-1 before any work, and then
>>>>> write
>>>>> seq after work, like what is done in Mesa?
>>>> No. This hw workaround requires that two consecutive write
>>>> operations
>>>> happen directly behind each other on the PCIe bus with two
>>>> different
>>>> values.
>>> Well to be honest the workaround code in Mesa seems to not be
>>> working
>>> in this way ...
>> Mesa doesn't have any workaround for that hw issue, the code there
>> uses
>> a quite different approach.
> Ah? Commit bf26da927a1c ("drm/amdgpu: add cache flush workaround to
> gfx8 emit_fence") says "Both PAL and Mesa use it for gfx8 too, so port
> this commit to gfx_v8_0_ring_emit_fence_gfx", so maybe the workaround
> should just be not necessary here?

What I meant was that Mesa doesn't have a hack like writing seq - 1 and 
then seq.

I haven't checked the code, but it uses a different approach with 64bit 
values as far as I know.

>>>> To make the software logic around that work without any changes
>>>> we
>>>> use
>>>> the values seq - 1 and seq because those are guaranteed to be
>>>> different
>>>> and not trigger any unwanted software behavior.
>>>>
>>>> Only then we can guarantee that we have a coherent view of system
>>>> memory.
>>> Any more details about it?
>> No, sorry. All I know is that it's a bug in the cache flush logic
>> which
>> can be worked around by issuing two write behind each other to the
>> same
>> location.
> So the issue is that the first EOP write does not properly flush the
> cache? Could EVENT_WRITE be used instead of EVENT_WRITE_EOP in this
> workaround to properly flush it without hurting the fence value?

No, EVENT_WRITE is executed at a different time in the pipeline.

>>> ...
>> Well to be honest on a platform where even two consecutive writes to
>> the
>> same location doesn't work I would have strong doubts that it is
>> stable
>> in general.
> Well I think the current situation is that the IRQ triggered by the
> second EOP packet arrives before the second write is finished, not the
> second write is totally dropped.

Well that sounds like the usual re-ordering problems we have seen 
patches for on Loongson multiple times now.

And I can only repeat what I've wrote before: We don't accept 
workarounds in drivers for problems cause by severely platform issues.

Especially when that is clearly against any PCIe specification.

Regards,
Christian.

>
>> Regards,
>> Christian.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20240617/d9efb582/attachment.htm>