[PATCH] drm/amdgpu: Adding wait time before reading upll control register

Mon Jul 6 12:59:17 UTC 2020

Am 06.07.20 um 14:42 schrieb Luben Tuikov:
> On 2020-06-30 3:01 a.m., Christian König wrote:
>> Am 30.06.20 um 00:46 schrieb Luben Tuikov:
>>> On 2020-06-26 1:04 p.m., Christian König wrote:
>>>> Am 26.06.20 um 18:12 schrieb Alex Jivin:
>>>> [SNIP] 
>>>> Adding a delay here is probably just postponing that. Do we have some
>>>> note in the documentation that this is necessary?
>>> Flushing the write buffer (by issuing a read if necessary) is different
>>> than waiting for the hardware to process the request.
>> It's not flushing the write buffer. The hardware is often build in a way
>> that the next read is triggering the action instead of the write.
> I don't think that the "read is triggering the action", in this case.
> The write is triggering the action, when it makes it to the hardware.
>
> Also, often enough, the coherency unit is built with dependency detection.
> This means that a read following a write to the same "hardware" address,
> would cause the read op to trigger the flushing of the write buffer
> if the address is still in the write buffer. And then the read
> would be issued, following the flush, usually by the bus unit (BU).
> (So the bus would show a bunch of WRITES followed by our READ.)
> You seem to agree as you said that in (A) above.

Yeah, but this is not what I meant here.

See a bit below for more description on what our hardware often does.

>> That's done because writes should be accepted by the hardware block
>> immediately while reads have a timeout associated with it.
> That's is true but irrelevant here.

No it isn't. See the read of the register acts as a delay as well.

The read doesn't comes back to the CPU immediately when there is an 
ongoing action to wait for.

Some of the SDMA, TC and HDP registers even go as far as delaying the 
return of the read until some memory is flushed.

The maximum timeout here is rather large, something in the range of 65ms 
IIRC.

> As the code is present right now in the kernel driver,
> if the hardware changes state on the last delay of 10 ms,
> you're going to miss that as no checks are done after the delay.
> The code clearly shows this.
>
> All I'm saying is that to avoid this "race" condition,
> the code should be modified, as there is no point in delaying 10
> ms only to quit after that, when count gets to equal MAX_COUNT.

Well, that is indeed a good point!

>>> The current code does flush the write buffer by reading back,
>>> and then delaying (both in the loop), which does achieve this,
>>> but it leaves a corner case as I wrote in my review.
>>> The corner case is that if the status
>>> register changes to "done" while in the last "delay()"
>>> we then check the loop invariant and exit, as opposed to reading
>>> the register and determining that it is done successfully.
>>>
>>> Current, all in pseudo-code:
>>>
>>> write()
>>>
>>> for (count = 0; count < MAX_COUNT; count++) {
>>> 	res = read()
>>> 	if (res is success) break
>>> 	mdelay(10)              <-- if it changes here, we miss it on the last iteration
>>> }
>>>
>>> Optimal:
>>>
>>> write()   ; assume write buffer flush
>>> mdelay(9)
>>> do {
>>> 	mdelay(1)
>>> 	res = read()
>>> } while (res != success && ++count < MAX_COUNT)
>>>
>>> This solves the corner case, and ensures a time delay of 10 for
>>> the hardware to process its job, but a delay of 1 for polling
>>> the status, as it could be done "anytime now."
>> That seems to assume that the ACK bits are not immediately cleared on
>> the write. How do you got to that conclusion?
> No such assumption was taken.
> This is about the wasted 10 ms delay
> as pointed to by the current code quoted above; the comment
> on the line with an arrow in it.

Yeah, now I understand what you are trying to do here. Problem is that 
this messes up the read delay feature of the hardware with that suggestion.

In other words we should not have is a delay between the initial write 
and the first read since the hardware will take care of this for us.

Regards,
Christian.

>
>> Processing the request can take even longer than 10ms depending on what
>> is done, which reference clock is selected, the temperature of the
>> hardware, sleep mode etc etc...
> That's why the loop.
>
> [snip!]
>>> Assuming that WREG32_P() flushes the write buffer, as it seems
>>> that it does, the idea here is to give the hardware some time to process
>>> the request (from writing a value to it), but when polling to poll
>>> a shorter amount of time.
>> Not a bad idea, but this is based on the diagnostic code and already
>> used for nearly a decade.
> Yes, I understand that it does work right now. Surely,
> if you set the loop limit high enough, you can show
> that you'd iterate at least one over the number required
> to complete or fail, and thus "used for nearly a decade".
>
> But from a computational point of view it is not correct
> in that the very last delay is a wasted delay:
>
> 	write()
> 	for (count = 0; count < MAX_COUNT; count++) {
> 		res = read()
> 		if (res is success) break
> 		mdelay(10)              <-- if it changes here, we miss it on the last iteration
> 	}
>
> It's not necessary to fix this, but it is good to point out
> that this is currently the case.
>
>> In other words this code is what the hardware engineers recommended, is
>> used *much* more often outside the driver and known to be working well.
> Ah, yes, I didn't want to resort to the "hardware engineers recommended",
> but here is what I've been told by hardware engineers (as seen by their
> interface):
>
> 1. Write your request to the hardware.
> 2. Wait M ms for the hardware to do its thing.
> 3. Read back and check status.
> 4. If done, then you're good.
> 5. If not done, wait a tiny bit of time, N, and
>     then check again, until done or timeout.
>
> Usually N << M, since M is the actual time for
> hardware do do something, and N is a lot smaller
> than M, since it is just back-off time as
> the hardware can be done "anytime now".
>
> As a waveform it would look like this:
>
> Immediate success: WRITE, 10, READ.
> Delayed success:   WRITE, 10, READ, 1, READ 1, READ 1, ..., READ.
> Delayed failure:   WRITE, 10, READ, 1, READ 1, READ 1, ..., READ.
>
> The do-while loop above accomplishes this. The loop always ends on a read.
> The current code would end on a delay on a failure, like this:
> Delayed failure (count == MAX_COUNT): WRITE, READ, 10, READ, 10, READ, 10.
>
> But, as it is working right now "for a decade",
> it can be left as is, as long as we realize that
> the "last" delay is not necessary, when we're failing.
>
>>> I would've also liked to see the mutex business fixed
>>> as well from my original patch review, as it is easy to prove
>>> that it is always taken, so no need to embed it inside the if().
>> That is indeed a software problem which should be fixed.
> I've not seen a patch fixing this yet.
>
> Regards,
> Luben
>
>> Regards,
>> Christian.
>>
>>> Regards,
>>> Luben
>>>
>>>> Christian.
>>>>
>>>>>     	/* Wait for CTLACK and CTLACK2 to get asserted */
>>>>>     	for (i = 0; i < SI_MAX_CTLACKS_ASSERTION_WAIT; ++i) {
>>>>>     		uint32_t mask = UPLL_CTLACK_MASK | UPLL_CTLACK2_MASK;