[PATCH] drm/etnaviv: during dump, read registers twice

Derek Foreman derek.foreman at collabora.com
Fri May 3 18:47:46 UTC 2024


On 2024-05-03 13:44, Lucas Stach wrote:
> Am Freitag, dem 03.05.2024 um 12:54 -0500 schrieb Derek Foreman:
>>   Hi Lucas,
>>    
>> On 2024-05-03 12:12, Lucas Stach wrote:
>>    
>>> Hi Derek,
>>>
>>> Am Freitag, dem 03.05.2024 um 09:27 -0500 schrieb Derek Foreman:
>>>
>>>> I have a follow up question related to this:
>>>>
>>>> In etnaviv_sched_timedoutjob we only read this register once - we should
>>>> probably read twice there as well?
>>> The register read being unreliable seems to be a specialty of the GPU
>>> you are dealing with. I'm not saying that making this more robust is
>>> the wrong thing to do, but I've seen quite a few GPU hang issues over
>>> time and the reported DMA address always made sense on the GPUs I've
>>> seen so far.
>>>    
>> Yes, this is a GC7000 rev 6009, and it seems to occasionally return very exciting values on the first read.
>>    
>> An example:
>>   0000065c = 0000c1f8
>>   00000660 = 00000001 Cmd: [dec DMA: idle Fetch: idle] Req idle Cal idle
>>   00000664 = 00000905 Command DMA address
>>   00000668 = 0000c1f8 FE fetched word 0
>>   0000066c = 080105d0 FE fetched word 1
>>   00000670 = feb71000
>>   
>> Some of the values are shifted by one register. 905 should have been in 660, c1f8 should have been in 664, and the state load should have been in 668 instead of 66c. Very confusing.
>>    
>>   I can only assume this weirdness is why the gpu-viv driver unconditionally reads the DMA registers twice every time. I expect that the extra read is pointless on most of the registers in the dump, but I was lazy - would you prefer I limit the extra read to only DMA registers?
>>   
> Seems like someone built a clock-domain crossing issue into that
> specific hardware variant. Probably result being returned to the bus
> before the actual value has been fetched from the FE internal state.
>
>> It looks like we use these registers in other places (verify_dma, etnaviv_gpu_debugfs). Should I add this double read there as well?
>>
>>
> If there is a CDC issue in the hardware we should probably repeat all
> reads from the FE registers. This could simply be fixed up by matching
> the FE address range in gpu_read and having the double read contained
> in this function.
Yes - this seems like the right thing to do. I'll replace this patch 
shortly. Please disregard this one. :)
>>>> Further, that function appears to always treat the first timeout as
>>>> spurious because gpu->hangcheck_dma_addr starts as 0? The Vivante
>>>> driver's hang check spins briefly to see if forward progress is being
>>>> made, would that be a better thing to do there?
>>> Briefly spinning will yield many false positives, as the DMA frontend
>>> address doesn't move during large draws. The timeout we are using is
>>> short enough that in most cases it's okay that we need to hit the
>>> timeout twice until we mark the GPU as hung, but also long enough so it
>>> doesn't get confused by large draws.
>> Fair enough - I won't bother sending a patch to change that behavior.
>>
>> I do wonder if setting the timeout to be twice as long instead of requiring it to trigger twice would be more intuitive, though.
>>   
> It's not totally obvious, but we need the first timeout trigger to
> record the current FE address to see if the FE still makes progress
> inside the current job.
>
> Regards,
> Lucas
>> Thanks,
>>   Derek
>>   
>>
>>
>>
>>
>>
>>
>>
>>   
>>
>>
>>
>>
>>
>>
>>
>>
>>   
>>
>>
>>
>>
>>
>>
>>
>>   
>>
>>
>>
>>
>>
>>
>>
>>>   
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Regards,
>>> Lucas
>>>
>>>   
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>>   
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Thanks,
>>>> Derek
>>>>
>>>> On 2024-05-03 09:22, Derek Foreman wrote:
>>>>   
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>   
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> The vivante driver always reads dma registers twice and discards the first
>>>>> value - we need to do this too or at least the DMA address and low/high
>>>>> fetches can return wrong results.
>>>>>
>>>>> Signed-off-by: Derek Foreman<derek.foreman at collabora.com>
>>>>> ---
>>>>>    drivers/gpu/drm/etnaviv/etnaviv_dump.c | 2 ++
>>>>>    1 file changed, 2 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_dump.c b/drivers/gpu/drm/etnaviv/etnaviv_dump.c
>>>>> index 898f84a0fc30..8a8ca8dcc49a 100644
>>>>> --- a/drivers/gpu/drm/etnaviv/etnaviv_dump.c
>>>>> +++ b/drivers/gpu/drm/etnaviv/etnaviv_dump.c
>>>>> @@ -91,6 +91,8 @@ static void etnaviv_core_dump_registers(struct core_dump_iterator *iter,
>>>>>    		    read_addr <= VIVS_PM_PULSE_EATER)
>>>>>    			read_addr = gpu_fix_power_address(gpu, read_addr);
>>>>>    		reg->reg = cpu_to_le32(etnaviv_dump_registers[i]);
>>>>> +		/* Discard first read, as it is frequently inaccurate */
>>>>> +		gpu_read(gpu, read_addr);
>>>>>    		reg->value = cpu_to_le32(gpu_read(gpu, read_addr));
>>>>>    	}
>>>>>    
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>   
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>   
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>    
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>   
>>
>>
>>
>>
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/etnaviv/attachments/20240503/4990dfa9/attachment.htm>


More information about the etnaviv mailing list