[PATCH] drm/etnaviv: during dump, read registers twice

Fri May 3 17:54:53 UTC 2024

Hi Lucas,

On 2024-05-03 12:12, Lucas Stach wrote:
> Hi Derek,
>
> Am Freitag, dem 03.05.2024 um 09:27 -0500 schrieb Derek Foreman:
>> I have a follow up question related to this:
>>
>> In etnaviv_sched_timedoutjob we only read this register once - we should
>> probably read twice there as well?
>>
> The register read being unreliable seems to be a specialty of the GPU
> you are dealing with. I'm not saying that making this more robust is
> the wrong thing to do, but I've seen quite a few GPU hang issues over
> time and the reported DMA address always made sense on the GPUs I've
> seen so far.

Yes, this is a GC7000 rev 6009, and it seems to occasionally return very 
exciting values on the first read.

An example:

0000065c = 0000c1f8
00000660 = 00000001 Cmd: [dec DMA: idle Fetch: idle] Req idle Cal idle
00000664 = 00000905 Command DMA address
00000668 = 0000c1f8 FE fetched word 0
0000066c = 080105d0 FE fetched word 1
00000670 = feb71000

Some of the values are shifted by one register. 905 should have been in 
660, c1f8 should have been in 664, and the state load should have been 
in 668 instead of 66c. Very confusing.

  I can only assume this weirdness is why the gpu-viv driver 
unconditionally reads the DMA registers twice every time. I expect that 
the extra read is pointless on most of the registers in the dump, but I 
was lazy - would you prefer I limit the extra read to only DMA registers?

It looks like we use these registers in other places (verify_dma, 
etnaviv_gpu_debugfs). Should I add this double read there as well?

>> Further, that function appears to always treat the first timeout as
>> spurious because gpu->hangcheck_dma_addr starts as 0? The Vivante
>> driver's hang check spins briefly to see if forward progress is being
>> made, would that be a better thing to do there?
>>
> Briefly spinning will yield many false positives, as the DMA frontend
> address doesn't move during large draws. The timeout we are using is
> short enough that in most cases it's okay that we need to hit the
> timeout twice until we mark the GPU as hung, but also long enough so it
> doesn't get confused by large draws.

Fair enough - I won't bother sending a patch to change that behavior.

I do wonder if setting the timeout to be twice as long instead of 
requiring it to trigger twice would be more intuitive, though.

Thanks,
Derek

> Regards,
> Lucas
>
>> Thanks,
>> Derek
>>
>> On 2024-05-03 09:22, Derek Foreman wrote:
>>> The vivante driver always reads dma registers twice and discards the first
>>> value - we need to do this too or at least the DMA address and low/high
>>> fetches can return wrong results.
>>>
>>> Signed-off-by: Derek Foreman<derek.foreman at collabora.com>
>>> ---
>>>    drivers/gpu/drm/etnaviv/etnaviv_dump.c | 2 ++
>>>    1 file changed, 2 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_dump.c b/drivers/gpu/drm/etnaviv/etnaviv_dump.c
>>> index 898f84a0fc30..8a8ca8dcc49a 100644
>>> --- a/drivers/gpu/drm/etnaviv/etnaviv_dump.c
>>> +++ b/drivers/gpu/drm/etnaviv/etnaviv_dump.c
>>> @@ -91,6 +91,8 @@ static void etnaviv_core_dump_registers(struct core_dump_iterator *iter,
>>>    		    read_addr <= VIVS_PM_PULSE_EATER)
>>>    			read_addr = gpu_fix_power_address(gpu, read_addr);
>>>    		reg->reg = cpu_to_le32(etnaviv_dump_registers[i]);
>>> +		/* Discard first read, as it is frequently inaccurate */
>>> +		gpu_read(gpu, read_addr);
>>>    		reg->value = cpu_to_le32(gpu_read(gpu, read_addr));
>>>    	}
>>>    
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/etnaviv/attachments/20240503/1d94b24e/attachment.htm>