[RFC PATCH v2 10/17] WIP: gpu: host1x: Add no-recovery mode

Sat Sep 12 13:31:32 UTC 2020

On 9/12/20 3:53 PM, Dmitry Osipenko wrote:
> 12.09.2020 01:11, Mikko Perttunen пишет:
>> On 9/11/20 7:40 PM, Dmitry Osipenko wrote:
>>> 05.09.2020 13:34, Mikko Perttunen пишет:
>>>> +    } else {
>>>> +        struct host1x_job *failed_job = job;
>>>> +
>>>> +        host1x_job_dump(dev, job);
>>>> +
>>>> +        host1x_syncpt_set_locked(job->syncpt);
>>>> +        failed_job->cancelled = true;
>>>> +
>>>> +        list_for_each_entry_continue(job, &cdma->sync_queue, list) {
>>>> +            unsigned int i;
>>>> +
>>>> +            if (job->syncpt != failed_job->syncpt)
>>>> +                continue;
>>>> +
>>>> +            for (i = 0; i < job->num_slots; i++) {
>>>> +                unsigned int slot = (job->first_get/8 + i) %
>>>> +                            HOST1X_PUSHBUFFER_SLOTS;
>>>> +                u32 *mapped = cdma->push_buffer.mapped;
>>>> +
>>>> +                mapped[2*slot+0] = 0x1bad0000;
>>>> +                mapped[2*slot+1] = 0x1bad0000;
>>>
>>> The 0x1bad0000 is a valid memory address on Tegra20.
>>>
>>> The 0x60000000 is invalid phys address for all hardware generations.
>>> It's used by grate-kernel [1] and VDE driver [2]. Note that the 0x6 <<
>>> 28 is also invalid Host1x opcode, while 0x1 should break CDMA parser
>>> during of PB debug-dumping.
>>>
>>> [1]
>>> https://github.com/grate-driver/linux/blob/master/drivers/gpu/drm/tegra/gem.h#L16
>>>
>>>
>>> [2]
>>> https://elixir.bootlin.com/linux/v5.9-rc4/source/drivers/staging/media/tegra-vde/iommu.c#L99
>>>
>>>
>>> The VDE driver reserves the trapping IOVA addresses, I assume the Host1x
>>> driver should do the same.
>>>
>>
>> The 0x1bad0000's are not intended to be memory addresses, they are NOOP
>> opcodes (INCR of 0 words to offset 0xbad). I'll fix this to use proper
>> functions to construct the opcodes and add some comments. These need to
>> be NOOP opcodes so the command parser skips over these "cancelled" jobs
>> when the channel is resumed.
>>
>> BTW, 0x60000000 is valid on Tegra194 and later.
> 
> At a quick glance it looked like a memory address :)

It does look a bit like one :) I'll add a comment to make it clear.

> 
> I'm now taking a closer look at this patch and it raises some more
> questions, like for example by looking at the "On job timeout, we stop
> the channel, NOP all future jobs on the channel using the same syncpoint
> ..." through the prism of grate-kernel experience, I'm not sure how it
> could co-exist with the drm-scheduler and why it's needed at all. But I
> think we need a feature-complete version (at least a rough version), so
> that we could start the testing, and then it should be easier to review
> and discuss such things.
> 

The reason this is needed is that if a job times out and we don't do its 
syncpoint increments on the CPU, then a successive job incrementing that 
same syncpoint would cause fences to signal incorrectly. The job that 
was supposed to signal those fences didn't actually run; and any data 
those fences were protecting would still be garbage.