Looking for a start point for fixing a bug

Sun Aug 10 23:50:32 PDT 2014

On Monday 11 August 2014, 08:03:57, Адонай Элохим wrote:
> 2014-08-11 1:53 GMT+04:00 Niels Ole Salscheider
> 
> <niels_ole at salscheider-online.de>:
> > On Monday 11 August 2014, 01:19:32, Адонай Элохим wrote:
> >> Hello again, hope you are still reading my texts...
> >> 
> >> I digged through the code and narrowed down the issue I wanted to fix.
> >> It appears to be related to the `bool thermal_active` dpm struct
> >> member and this piece of code:
> >> 
> >> if (rdev->asic->dpm.force_performance_level) {
> >> 
> >>         if (rdev->pm.dpm.thermal_active) {
> >>         
> >>             enum radeon_dpm_forced_level level =
> >>             rdev->pm.dpm.forced_level;
> >>             /* force low perf level for thermal */
> >>             radeon_dpm_force_performance_level(rdev,
> >> 
> >> RADEON_DPM_FORCED_LEVEL_LOW);
> >> 
> >>             /* save the user's level */
> >>             rdev->pm.dpm.forced_level = level;
> >>         
> >>         } else {
> >>         
> >>             /* otherwise, user selected level */
> >>             radeon_dpm_force_performance_level(rdev,
> >> 
> >> rdev->pm.dpm.forced_level); }
> >> 
> >>     }
> >> 
> >> I did a double check here - at boot `thermal_active` is `false` and
> >> thus, performance level is properly initiated. But at resume from
> >> suspend `thermal_active` is true and performance level is strictly
> >> bound to low profile.
> >> Besides you cannot change it via echo 1 > /sys/.../force_dpm_level,
> >> again thanks to `thermal_active` checked there.
> >> 
> >> Could you explain meaning of this small boolean to me? I'd like to
> >> make a small neat patch fixing this, but I'm scared of doing it in
> >> wrong way.
> >> Sorry if I'm being too persistent.
> > 
> > I think thermal_active means that the temperature got too high so that low
> > clocks have to be used.
> > 
> > Just some idea, but thermal.work only gets scheduled when the high to low
> > temperature interrupt occurs. When the temperature is too high before
> > suspend (so that thermal_active is true) and it gets low during standby
> > this interrupt will not occur. thermal.work is therefore not scheduled...
> > 
> > Ole
> 
> You were right, Ole. The driver thinks the temperature is too high.
> Thanks a lot!
> It seems the function ci_set_thermal_temperature_range is missing some
> lines:
> 
> 
> diff --git a/drivers/gpu/drm/radeon/ci_dpm.c
> b/drivers/gpu/drm/radeon/ci_dpm.c index 584090a..102a4bc 100644
> --- a/radeon/ci_dpm.c
> +++ b/radeon/ci_dpm.c
> @@ -869,6 +869,9 @@ static int ci_set_thermal_temperature_range(struct
> radeon_device *rdev,
>         WREG32_SMC(CG_THERMAL_CTRL, tmp);
>  #endif
> 
> +    rdev->pm.dpm.thermal.min_temp = low_temp;
> +    rdev->pm.dpm.thermal.max_temp = high_temp;
> +
>         return 0;
>  }
> 
> 
> All other similar callbacks for different families of cards have these
> lines. I wonder if there is any specific case for not doing this...

This seems to be wrong indeed. I wonder why it happens after a suspend - 
resume cycle, though. Does the hardware generate a corresponding interrupt 
after resume?

There is however still an XXX comment in that function... Maybe Alex can 
comment on that.

> How do I propose it as a patch anyway?

Fix it in your local git checkout, commit it (don't forget to pass -s to get 
your signed-of-by line) and use git format-patch / git send-email to send it 
to this list...

> >> Thanks,
> >> Oleg
> >> 
> >> 2014-07-22 20:05 GMT+04:00 Alex Deucher <alexdeucher at gmail.com>:
> >> > On Tue, Jul 22, 2014 at 8:39 AM, Адонай Элохим <algonkvel at gmail.com>
> > 
> > wrote:
> >> >> Hello all!
> >> >> 
> >> >> I have some spare time and knowledge in C to try to fix some bugs I am
> >> >> seeing on my machine.
> >> >> So I've checked out and compiled all git trees that I may need and now
> >> >> I'm
> >> >> beginning to read articles.
> >> >> 
> >> >> And this is the point from where I don't know where to go. I want to
> >> >> fix
> >> >> particular bug #79806 [1].
> >> >> For me there are many places where this bug can hide - mesa? dri?
> >> >> radeon
> >> >> kernel module? and I just don't know whether should I start reading
> >> >> articles about mesa hacking or about dri architecture or about kernel
> >> >> module development.
> >> >> 
> >> >> Now I think the best thing for me is to start looking through radeon
> >> >> kernel
> >> >> module code (I've got ingenious idea that power management resides
> >> >> there)
> >> >> and read more about its architecture. Is this right? I mean, I just
> >> >> want
> >> >> to
> >> >> find out, is this a right place to start looking at for this bug?
> >> > 
> >> > The power management is handled in the kernel driver.  See radeon_pm.c
> >> > and the relevant *_dpm.c files depending on what asic you have.
> >> > 
> >> > Alex
> >> > 
> >> >> P.S. Sorry for my English in case it's not good, I'm learning it now
> >> >> 
> >> >> P.P.S. And thanks for your hard work!
> >> >> 
> >> >> -------------------------------------------
> >> >> [1] https://bugs.freedesktop.org/show_bug.cgi?id=79806
> >> >> 
> >> >> _______________________________________________
> >> >> dri-devel mailing list
> >> >> dri-devel at lists.freedesktop.org
> >> >> http://lists.freedesktop.org/mailman/listinfo/dri-devel
> >> 
> >> _______________________________________________
> >> dri-devel mailing list
> >> dri-devel at lists.freedesktop.org
> >> http://lists.freedesktop.org/mailman/listinfo/dri-devel