[Intel-gfx] [PATCH 1/2] Give up on edid retries when i2c tells us that bus is not there

Tue Oct 18 06:37:38 PDT 2011

On 10/18/2011 11:14, Jean Delvare wrote:
> Hi Dave, Eugeni, Alex,
>
> On Tue, 18 Oct 2011 11:02:00 +0100, Dave Airlie wrote:
>>> This allows to avoid talking to a non-existent bus repeatedly until we
>>> finally timeout. The non-existent bus is signalled by -ENXIO error,
>>> provided by i2c_algo_bit:bit_doAddress call.
>>>
>>> As the advantage of such change, all the other routines which use
>>> drm_get_edid would benefit for this timeout.
>>>
>>> This change should fix
>>> https://bugs.freedesktop.org/show_bug.cgi?id=41059 and improve overall
>>> edid detection timing by 10-30% in most cases, and by a much larger margin
>>> in case of phantom outputs.
>>
>> Jean, Alex,
>>
>> I'm thinking of thowing this into -next, any objections?
>
> This seems to be the wrong place for the test. The code below is really
> testing for non-responsive (possibly not present) EDID, NOT
> "non-existent adapter". Non-existent adapter should be checked in the
> firmware if possible, or failing that, by testing the bus lines at bus
> creation time, as i2c_algo_bit.bit_test=1 is doing. While this setting
> has been known to cause trouble recently (not per se but because it was
> triggering a bug somewhere else in the radeon driver), it might still
> have value, and could be changed to a per-adapter setting by exporting
> the test function (I have a patch ready doing exactly this) and letting
> video drivers test their I2C buses and discard the non-working ones if
> they want.
>
> I am worried that the patch below will cause regressions on other
> machines. There's a comment right before the loop which reads:
>
> 	/* The core i2c driver will automatically retry the transfer if the
> 	 * adapter reports EAGAIN. However, we find that bit-banging transfers
> 	 * are susceptible to errors under a heavily loaded machine and
> 	 * generate spurious NAKs and timeouts. Retrying the transfer
> 	 * of the individual block a few times seems to overcome this.
> 	 */
>
> So the retries are there for a reason, and -ENXIO is exactly what you
> get on spurious NAKs.
>

Hi Jean,

You are right about the bit_test=1 testing, my initial attempt at the
fix did exactly that
(https://bugs.freedesktop.org/show_bug.cgi?id=41059, comments 14 and 15).

The problem is that for some buses, namely HDMI ones in my testing,
bit_test-like tests always consider them as non-existent when the
connectivity is not setup; and as working when it is. So in any case, we
could not just blacklist them - when they do, they are gone for good,
and won't work anymore, and we need to keep re-trying them at each attempt.

And in case of continuous pre-testing for the stuck bits and like when
driver attempts to get the edid (for example, when xrandr is run), we
still hit the same issue - the drm_do_probe_ddc_edid will continue to
retry the attempts until it reaches the maximum number of retries. E.g., there
is no way to tell drm_do_probe_ddc_edid to treat any return code as a
permanent failure and make it give up faster.

It could be fixed either in per-driver fashion, like I did with the other patch
(which also tests for -ENXIO); or in a generic way, in DRM. I couldn't reproduce
any false positives coming from -ENXIO on i915 driver, but perhaps it is
easier with radeon? Do you have any specific workload which trick the
driver into generating spurious NAKs by a chance?

> One thing which may make sense would be to make the retry count a
> module parameter, instead of a hard-coded value, so that users can
> lower the value if they care more about boot time than reliability. But
> again, ideally problematic buses would not have been created in the
> first place.

Or perhaps it would be possible to have any error code coming from
i2c_transfer to signal a permanent error, which should not be retried..
what do you think?

--
Eugeni Dodonov