[Intel-gfx] [DMC_BUGFIX_SKL_V2 1/5] drm/i915/skl: Added a check for the hardware status of csr fw before loading.

Mon Sep 14 00:46:13 PDT 2015

On Fri, Sep 11, 2015 at 12:36:24AM +0530, Animesh Manna wrote:
> 
> 
> On 9/10/2015 8:15 PM, Daniel Vetter wrote:
> >On Thu, Sep 10, 2015 at 01:58:54AM +0530, Animesh Manna wrote:
> >>
> >>On 9/2/2015 2:24 PM, Daniel Vetter wrote:
> >>>On Wed, Aug 26, 2015 at 07:40:54PM +0530, Animesh Manna wrote:
> >>>>On 8/26/2015 6:40 PM, Daniel Vetter wrote:
> >>>>>On Wed, Aug 26, 2015 at 01:36:05AM +0530, Animesh Manna wrote:
> >>>>>>Dmc will restore the csr program except DC9, cold boot,
> >>>>>>warm reset, PCI function level reset, and hibernate/suspend.
> >>>>>>
> >>>>>>intel_csr_load_program() function is used to load the firmware
> >>>>>>data from kernel memory to csr address space.
> >>>>>>
> >>>>>>All values of csr address space will be zero if it got reset and
> >>>>>>the first byte of csr program is always a non-zero if firmware
> >>>>>>is loaded successfuly. Based on hardware status will load the
> >>>>>>firmware.
> >>>>>>
> >>>>>>Without this condition check if we overwrite the firmware data the
> >>>>>>counters exposed for dc5/dc6 (help for debugging) will be nullified.
> >>>>Bacause of the above reason mentioned just above we need to block firmware loading again.
> >>>>So only WARN_ON will not help.
> >>>>
> >>>>
> >>>>>>v1: Initial version.
> >>>>>>
> >>>>>>v2: Based on review comments from Daniel,
> >>>>>>- Added a check to know hardware status and load the firmware if not loaded.
> >>>>>>
> >>>>>>Cc: Daniel Vetter <daniel.vetter at intel.com>
> >>>>>>Cc: Damien Lespiau <damien.lespiau at intel.com>
> >>>>>>Cc: Imre Deak <imre.deak at intel.com>
> >>>>>>Cc: Sunil Kamath <sunil.kamath at intel.com>
> >>>>>>Signed-off-by: Animesh Manna <animesh.manna at intel.com>
> >>>>>>Signed-off-by: Vathsala Nagaraju <vathsala.nagaraju at intel.com>
> >>>>>>---
> >>>>>>  drivers/gpu/drm/i915/intel_csr.c | 9 +++++++++
> >>>>>>  1 file changed, 9 insertions(+)
> >>>>>>
> >>>>>>diff --git a/drivers/gpu/drm/i915/intel_csr.c b/drivers/gpu/drm/i915/intel_csr.c
> >>>>>>index ba1ae03..682cc26 100644
> >>>>>>--- a/drivers/gpu/drm/i915/intel_csr.c
> >>>>>>+++ b/drivers/gpu/drm/i915/intel_csr.c
> >>>>>>@@ -252,6 +252,15 @@ void intel_csr_load_program(struct drm_device *dev)
> >>>>>>  		return;
> >>>>>>  	}
> >>>>>>+	/*
> >>>>>>+	 * Dmc will restore the csr the program except DC9, cold boot,
> >>>>>>+	 * warm reset, PCI function level reset, and hibernate/suspend.
> >>>>>>+	 * This condition will help to check if csr address space is reset/
> >>>>>>+	 * not loaded.
> >>>>>>+	 */
> >>>>>Atm we call this from driver load and resume, which doesn seem to cover
> >>>>>all the cases you mention in the comment. Should this be a WARN_ON
> >>>>>instead? Or do we have troubles in our init sequence where we load too
> >>>>>many times?
> >>>>Yes, the above statement taken from bspec to describe about the special cases dmc will not restore the firmware.
> >>>>Agree, In our cases cold boot and hibernate/suspend mainly we need to load the firmware again, so in my
> >>>>second sentence I wanted to comment mainly regarding this condition check added for suspend-hibernate(reset)
> >>>>and cold boot(not loaded).
> >>>>
> >>>>Anyways the same api later can be used to load the firmware from anywhere, so my intention to check firmware loaded or not.
> >>>>If already loaded then not to overwrite the csr address space to maintain the dc5/dc6 counter value.
> >>>>
> >>>>Can the below comment more clear to you.
> >>>>
> >>>>	/*
> >>>>	 * Dmc will restore the csr the program except DC9, cold boot,
> >>>>	 * warm reset, PCI function level reset, and hibernate/suspend.
> >>>>	 * If firmware is restored by dmc then no need to load again which
> >>>>	 * will keep the dc5/dc6 counter exposed by firmware.
> >>>>	 */
> >>>>
> >>>>No issue in init sequence.
> >>>That seems to still cover all the callers of the function afaics - we do
> >>>pci resets over suspend resume unconditionally. So I still don't
> >>>understand where exactly we try to load the dmc firmware in i915.ko when
> >>>it's already loaded.
> >>During resume intel_csr_load_program() will be called from
> >>intel_runtime_resume().
> >>
> >>intel_runtime_resume()-> skl_resume_prepare()-> intel_csr_load_program()
> >>
> >>During Pc10 entry testing I can see dmc is restoring back the firmware always,
> >>but as you mentioned pci-reset can happen unconditionally, but still then
> >>also during resume intel_runtime_resume() will be called and based on
> >>register read of csr-base-address firmware loading will happen.
> >But in your comment you're saying it won't get restored in case of dc9 and
> >suspend. So that seems to mismatch what you're saying here (and what the
> >commit message says) and what the code does. And this function here is
> >called for resume after suspend/hibernate only.
> 
> pc10 entry explanation I told is for skylake. dc9 in skylake is not possible.
> I think you are confusing between dc6 and dc9. Pc10 can be achieved by
> entering into dc6 (not dc9) for skylake. dc9 is the lowest possible state
> for broxton which is not present for skylake.

I have no idea at all about different pc levels on skl. What I'm talking
about is system suspend/resume and driver load, which are the places this
function gets called. At least afaics.

> Here intel_csr_load_program() will be used for both skylake and broxton, and instruction
> execution flow will be different in case of suspend/resume which I think is confusing
> you.

That seems like really important information. What's different on bxt?
These are the kind of details you should explain in the commit message ...

> I am ready explain you in detail. It will be good if we discuss specific use-case scenario
> and itz software design for specific platform. Another point - as dmc related code for
> broxton is not merged better first we close design for skylake. Now, I have added dc9
> description in comment thinking of future. If you want I can remove for now and later
> can add in bxt patch series for enabling dmc. Will wait for your reply.

This question here isn't about the overall design and how to handle power
wells in skl/bxt. That's a separate discussion and tracked somewhere else.
I'm really just confused about when exactly we need to reload to firmware,
and why we need a runtime check for that. Normally we should know when to
reload the firmware and just either reload or not, without checking hw
state. And I don't like checking for hw state since at least in the past
that kind of code ended up being fragile - it's an illusion that it does
the right thing no matter what, since often there's other tricky ordering
constraints. And if you have automatic duct-tape like then no one will
ever spot those other, harder to spot issues, until an expensive customer
escalation happens.

So what I want to know here is:
- When exactly do we need to reload dmc firmware.
- What exactly is the reason why we can't make that decision statically in
  the code (by calling csr_load at the right spots).

Thanks, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch