Skip to content
Snippets Groups Projects
  1. Dec 01, 2020
    • Mel Gorman's avatar
      cpuidle: Select polling interval based on a c-state with a longer target residency · 7a25759e
      Mel Gorman authored
      
      It was noted that a few workloads that idle rapidly regressed when commit
      36fcb429 ("cpuidle: use first valid target residency as poll time")
      was merged. The workloads in question were heavy communicators that idle
      rapidly and were impacted by the c-state exit latency as the active CPUs
      were not polling at the time of wakeup. As they were not particularly
      realistic workloads, it was not considered to be a major problem.
      
      Unfortunately, a bug was reported for a real workload in a production
      environment that relied on large numbers of threads operating in a worker
      pool pattern. These threads would idle for periods of time longer than the
      C1 target residency and so incurred the c-state exit latency penalty. The
      application is very sensitive to wakeup latency and indirectly relying
      on behaviour prior to commit on a37b969a ("cpuidle: poll_state: Add
      time limit to poll_idle()") to poll for long enough to avoid the exit
      latency cost.
      
      The target residency of C1 is typically very short. On some x86 machines,
      it can be as low as 2 microseconds. In poll_idle(), the clock is checked
      every POLL_IDLE_RELAX_COUNT interations of cpu_relax() and even one
      iteration of that loop can be over 1 microsecond so the polling interval is
      very close to the granularity of what poll_idle() can detect. Furthermore,
      a basic ping pong workload like perf bench pipe has a longer round-trip
      time than the 2 microseconds meaning that the CPU will almost certainly
      not be polling when the ping-pong completes.
      
      This patch selects a polling interval based on an enabled c-state that
      has an target residency longer than 10usec. If there is no enabled-cstate
      then polling will be up to a TICK_NSEC/16 similar to what it was up until
      kernel 4.20. Polling for a full tick is unlikely (rescheduling event)
      and is much longer than the existing target residencies for a deep c-state.
      
      As an example, consider a CPU with the following c-state information from
      an Intel CPU;
      
      	residency	exit_latency
      C1	2		2
      C1E	20		10
      C3	100		33
      C6	400		133
      
      The polling interval selected is 20usec. If booted with
      intel_idle.max_cstate=1 then the polling interval is 250usec as the deeper
      c-states were not available.
      
      On an AMD EPYC machine, the c-state information is more limited and
      looks like
      
      	residency	exit_latency
      C1	2		1
      C2	800		400
      
      The polling interval selected is 250usec. While C2 was considered, the
      polling interval was clamped by CPUIDLE_POLL_MAX.
      
      Note that it is not expected that polling will be a universal win. As
      well as potentially trading power for performance, the performance is not
      guaranteed if the extra polling prevented a turbo state being reached.
      Making it a tunable was considered but it's driver-specific, may be
      overridden by a governor and is not a guaranteed polling interval making
      it difficult to describe without knowledge of the implementation.
      
      tbench4
      			     vanilla		    polling
      Hmean     1        497.89 (   0.00%)      543.15 *   9.09%*
      Hmean     2        975.88 (   0.00%)     1059.73 *   8.59%*
      Hmean     4       1953.97 (   0.00%)     2081.37 *   6.52%*
      Hmean     8       3645.76 (   0.00%)     4052.95 *  11.17%*
      Hmean     16      6882.21 (   0.00%)     6995.93 *   1.65%*
      Hmean     32     10752.20 (   0.00%)    10731.53 *  -0.19%*
      Hmean     64     12875.08 (   0.00%)    12478.13 *  -3.08%*
      Hmean     128    21500.54 (   0.00%)    21098.60 *  -1.87%*
      Hmean     256    21253.70 (   0.00%)    21027.18 *  -1.07%*
      Hmean     320    20813.50 (   0.00%)    20580.64 *  -1.12%*
      
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      7a25759e
  2. Sep 23, 2020
  3. Sep 22, 2020
  4. Sep 16, 2020
  5. Aug 26, 2020
  6. Jun 25, 2020
  7. Jun 23, 2020
    • Chen Yu's avatar
      PM: s2idle: Clear _TIF_POLLING_NRFLAG before suspend to idle · 81e67375
      Chen Yu authored
      
      Suspend to idle was found to not work on Goldmont CPU recently.
      
      The issue happens due to:
      
       1. On Goldmont the CPU in idle can only be woken up via IPIs,
          not POLLING mode, due to commit 08e237fa ("x86/cpu: Add
          workaround for MONITOR instruction erratum on Goldmont based
          CPUs")
      
       2. When the CPU is entering suspend to idle process, the
          _TIF_POLLING_NRFLAG remains on, because cpuidle_enter_s2idle()
          doesn't match call_cpuidle() exactly.
      
       3. Commit b2a02fc4 ("smp: Optimize send_call_function_single_ipi()")
          makes use of _TIF_POLLING_NRFLAG to avoid sending IPIs to idle
          CPUs.
      
       4. As a result, some IPIs related functions might not work
          well during suspend to idle on Goldmont. For example, one
          suspected victim:
      
          tick_unfreeze() -> timekeeping_resume() -> hrtimers_resume()
          -> clock_was_set() -> on_each_cpu() might wait forever,
          because the IPIs will not be sent to the CPUs which are
          sleeping with _TIF_POLLING_NRFLAG set, and Goldmont CPU
          could not be woken up by only setting _TIF_NEED_RESCHED
          on the monitor address.
      
      To avoid that, clear the _TIF_POLLING_NRFLAG flag before invoking
      enter_s2idle_proper() in cpuidle_enter_s2idle() in analogy with the
      call_cpuidle() code flow.
      
      Fixes: b2a02fc4 ("smp: Optimize send_call_function_single_ipi()")
      Suggested-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Suggested-by: default avatarRafael J. Wysocki <rafael@kernel.org>
      Reported-by: default avatarkbuild test robot <lkp@intel.com>
      Signed-off-by: default avatarChen Yu <yu.c.chen@intel.com>
      [ rjw: Subject / changelog ]
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      81e67375
  8. Feb 13, 2020
    • Rafael J. Wysocki's avatar
      PM: QoS: Drop PM_QOS_CPU_DMA_LATENCY notifier chain · 3a4a0042
      Rafael J. Wysocki authored
      
      Notice that pm_qos_remove_notifier() is not used at all and the only
      caller of pm_qos_add_notifier() is the cpuidle core, which only needs
      the PM_QOS_CPU_DMA_LATENCY notifier to invoke wake_up_all_idle_cpus()
      upon changes of the PM_QOS_CPU_DMA_LATENCY target value.
      
      First, to ensure that wake_up_all_idle_cpus() will be called
      whenever the PM_QOS_CPU_DMA_LATENCY target value changes, modify the
      pm_qos_add/update/remove_request() family of functions to check if
      the effective constraint for the PM_QOS_CPU_DMA_LATENCY has changed
      and call wake_up_all_idle_cpus() directly in that case.
      
      Next, drop the PM_QOS_CPU_DMA_LATENCY notifier from cpuidle as it is
      not necessary any more.
      
      Finally, drop both pm_qos_add_notifier() and pm_qos_remove_notifier(),
      as they have no callers now, along with cpu_dma_lat_notifier which is
      only used by them.
      
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Reviewed-by: default avatarUlf Hansson <ulf.hansson@linaro.org>
      Reviewed-by: default avatarAmit Kucheria <amit.kucheria@linaro.org>
      Tested-by: default avatarAmit Kucheria <amit.kucheria@linaro.org>
      3a4a0042
  9. Jan 22, 2020
  10. Dec 27, 2019
    • Rafael J. Wysocki's avatar
      cpuidle: Allow idle states to be disabled by default · 75a80267
      Rafael J. Wysocki authored
      
      In certain situations it may be useful to prevent some idle states
      from being used by default while allowing user space to enable them
      later on.
      
      For this purpose, introduce a new state flag, CPUIDLE_FLAG_OFF, to
      mark idle states that should be disabled by default, make the core
      set CPUIDLE_STATE_DISABLED_BY_USER for those states at the
      initialization time and add a new state attribute in sysfs,
      "default_status", to inform user space of the initial status of
      the given idle state ("disabled" if CPUIDLE_FLAG_OFF is set for it,
      "enabled" otherwise).
      
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      75a80267
  11. Dec 12, 2019
  12. Dec 09, 2019
  13. Nov 29, 2019
    • Rafael J. Wysocki's avatar
      cpuidle: Drop disabled field from struct cpuidle_state · ba1e78a1
      Rafael J. Wysocki authored
      
      After recent cpuidle updates the "disabled" field in struct
      cpuidle_state is only used by two drivers (intel_idle and shmobile
      cpuidle) for marking unusable idle states, but that may as well be
      achieved with the help of a state flag, so define an "unusable" idle
      state flag, CPUIDLE_FLAG_UNUSABLE, make the drivers in question use
      it instead of the "disabled" field and make the core set
      CPUIDLE_STATE_DISABLED_BY_DRIVER for the idle states with that flag
      set.
      
      After the above changes, the "disabled" field in struct cpuidle_state
      is not used any more, so drop it.
      
      No intentional functional impact.
      
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      ba1e78a1
  14. Nov 20, 2019
  15. Nov 11, 2019
    • Rafael J. Wysocki's avatar
      cpuidle: Use nanoseconds as the unit of time · c1d51f68
      Rafael J. Wysocki authored
      
      Currently, the cpuidle subsystem uses microseconds as the unit of
      time which (among other things) causes the idle loop to incur some
      integer division overhead for no clear benefit.
      
      In order to allow cpuidle to measure time in nanoseconds, add two
      new fields, exit_latency_ns and target_residency_ns, to represent the
      exit latency and target residency of an idle state in nanoseconds,
      respectively, to struct cpuidle_state and initialize them with the
      help of the corresponding values in microseconds provided by drivers.
      Additionally, change cpuidle_governor_latency_req() to return the
      idle state exit latency constraint in nanoseconds.
      
      Also meeasure idle state residency (last_residency_ns in struct
      cpuidle_device and time_ns in struct cpuidle_driver) in nanoseconds
      and update the cpuidle core and governors accordingly.
      
      However, the menu governor still computes typical intervals in
      microseconds to avoid integer overflows.
      
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarDoug Smythies <dsmythies@telus.net>
      Tested-by: default avatarDoug Smythies <dsmythies@telus.net>
      c1d51f68
  16. Nov 06, 2019
    • Rafael J. Wysocki's avatar
      cpuidle: Consolidate disabled state checks · 99e98d3f
      Rafael J. Wysocki authored
      
      There are two reasons why CPU idle states may be disabled: either
      because the driver has disabled them or because they have been
      disabled by user space via sysfs.
      
      In the former case, the state's "disabled" flag is set once during
      the initialization of the driver and it is never cleared later (it
      is read-only effectively).  In the latter case, the "disable" field
      of the given state's cpuidle_state_usage struct is set and it may be
      changed via sysfs.  Thus checking whether or not an idle state has
      been disabled involves reading these two flags every time.
      
      In order to avoid the additional check of the state's "disabled" flag
      (which is effectively read-only anyway), use the value of it at the
      init time to set a (new) flag in the "disable" field of that state's
      cpuidle_state_usage structure and use the sysfs interface to
      manipulate another (new) flag in it.  This way the state is disabled
      whenever the "disable" field of its cpuidle_state_usage structure is
      nonzero, whatever the reason, and it is the only place to look into
      to check whether or not the state has been disabled.
      
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: default avatarDaniel Lezcano <daniel.lezcano@linaro.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      99e98d3f
  17. Jul 30, 2019
  18. Apr 09, 2019
    • Ulf Hansson's avatar
      cpuidle: Export the next timer expiration for CPUs · 6f9b83ac
      Ulf Hansson authored
      
      To be able to predict the sleep duration for a CPU entering idle, it
      is essential to know the expiration time of the next timer.  Both the
      teo and the menu cpuidle governors already use this information for
      CPU idle state selection.
      
      Moving forward, a similar prediction needs to be made for a group of
      idle CPUs rather than for a single one and the following changes
      implement a new genpd governor for that purpose.
      
      In order to support that feature, add a new function called
      tick_nohz_get_next_hrtimer() that will return the next hrtimer
      expiration time of a given CPU to be invoked after deciding
      whether or not to stop the scheduler tick on that CPU.
      
      Make the cpuidle core call tick_nohz_get_next_hrtimer() right
      before invoking the ->enter() callback provided by the cpuidle
      driver for the given state and store its return value in the
      per-CPU struct cpuidle_device, so as to make it available to code
      outside of cpuidle.
      
      Note that at the point when cpuidle calls tick_nohz_get_next_hrtimer(),
      the governor's ->select() callback has already returned and indicated
      whether or not the tick should be stopped, so in fact the value
      returned by tick_nohz_get_next_hrtimer() always is the next hrtimer
      expiration time for the given CPU, possibly including the tick (if
      it hasn't been stopped).
      
      Co-developed-by: default avatarLina Iyer <lina.iyer@linaro.org>
      Co-developed-by: default avatarDaniel Lezcano <daniel.lezcano@linaro.org>
      Acked-by: default avatarDaniel Lezcano <daniel.lezcano@linaro.org>
      Signed-off-by: default avatarUlf Hansson <ulf.hansson@linaro.org>
      [ rjw: Subject & changelog ]
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      6f9b83ac
  19. Dec 12, 2018
  20. Dec 11, 2018
  21. Sep 18, 2018
  22. Apr 06, 2018
    • Rafael J. Wysocki's avatar
      cpuidle: Return nohz hint from cpuidle_select() · 45f1ff59
      Rafael J. Wysocki authored
      
      Add a new pointer argument to cpuidle_select() and to the ->select
      cpuidle governor callback to allow a boolean value indicating
      whether or not the tick should be stopped before entering the
      selected state to be returned from there.
      
      Make the ladder governor ignore that pointer (to preserve its
      current behavior) and make the menu governor return 'false" through
      it if:
       (1) the idle exit latency is constrained at 0, or
       (2) the selected state is a polling one, or
       (3) the expected idle period duration is within the tick period
           range.
      
      In addition to that, the correction factor computations in the menu
      governor need to take the possibility that the tick may not be
      stopped into account to avoid artificially small correction factor
      values.  To that end, add a mechanism to record tick wakeups, as
      suggested by Peter Zijlstra, and use it to modify the menu_update()
      behavior when tick wakeup occurs.  Namely, if the CPU is woken up by
      the tick and the return value of tick_nohz_get_sleep_length() is not
      within the tick boundary, the predicted idle duration is likely too
      short, so make menu_update() try to compensate for that by updating
      the governor statistics as though the CPU was idle for a long time.
      
      Since the value returned through the new argument pointer of
      cpuidle_select() is not used by its caller yet, this change by
      itself is not expected to alter the functionality of the code.
      
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      45f1ff59
  23. Mar 29, 2018
    • Rafael J. Wysocki's avatar
      PM: cpuidle/suspend: Add s2idle usage and time state attributes · 64bdff69
      Rafael J. Wysocki authored
      
      Add a new attribute group called "s2idle" under the sysfs directory
      of each cpuidle state that supports the ->enter_s2idle callback
      and put two new attributes, "usage" and "time", into that group to
      represent the number of times the given state was requested for
      suspend-to-idle and the total time spent in suspend-to-idle after
      requesting that state, respectively.
      
      That will allow diagnostic information related to suspend-to-idle
      to be collected without enabling advanced debug features and
      analyzing dmesg output.
      
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      64bdff69
  24. Nov 08, 2017
  25. Sep 28, 2017
  26. Aug 10, 2017
  27. May 15, 2017
  28. May 01, 2017
  29. Mar 02, 2017
  30. Dec 06, 2016
  31. Nov 29, 2016
  32. Jul 04, 2016
    • Shreyas B. Prabhu's avatar
      cpuidle: Fix last_residency division · dbd1b8ea
      Shreyas B. Prabhu authored
      
      Snooze is a poll idle state in powernv and pseries platforms. Snooze
      has a timeout so that if a CPU stays in snooze for more than target
      residency of the next available idle state, then it would exit
      thereby giving chance to the cpuidle governor to re-evaluate and
      promote the CPU to a deeper idle state. Therefore whenever snooze
      exits due to this timeout, its last_residency will be target_residency
      of the next deeper state.
      
      Commit e93e59ce "cpuidle: Replace ktime_get() with local_clock()"
      changed the math around last_residency calculation. Specifically,
      while converting last_residency value from nano- to microseconds, it
      carries out right shift by 10. Because of that, in snooze timeout
      exit scenarios last_residency calculated is roughly 2.3% less than
      target_residency of the next available state. This pattern is picked
      up by get_typical_interval() in the menu governor and therefore
      expected_interval in menu_select() is frequently less than the
      target_residency of any state other than snooze.
      
      Due to this we are entering snooze at a higher rate, thereby
      affecting the single thread performance.
      
      Fix this by using more precise division via ktime_us_delta().
      
      Fixes: e93e59ce "cpuidle: Replace ktime_get() with local_clock()"
      Reported-by: default avatarAnton Blanchard <anton@samba.org>
      Bisected-by: default avatarShilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
      Signed-off-by: default avatarShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Acked-by: default avatarDaniel Lezcano <daniel.lezcano@linaro.org>
      Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      dbd1b8ea
  33. May 18, 2016
    • Daniel Lezcano's avatar
      cpuidle: Fix cpuidle_state_is_coupled() argument in cpuidle_enter() · e7387da5
      Daniel Lezcano authored
      
      Commit 0b89e9aa (cpuidle: delay enabling interrupts until all
      coupled CPUs leave idle) rightfully fixed a regression by letting
      the coupled idle state framework to handle local interrupt enabling
      when the CPU is exiting an idle state.
      
      The current code checks if the idle state is coupled and, if so, it
      will let the coupled code to enable interrupts. This way, it can
      decrement the ready-count before handling the interrupt. This
      mechanism prevents the other CPUs from waiting for a CPU which is
      handling interrupts.
      
      But the check is done against the state index returned by the back
      end driver's ->enter functions which could be different from the
      initial index passed as parameter to the cpuidle_enter_state()
      function.
      
       entered_state = target_state->enter(dev, drv, index);
      
       [ ... ]
      
       if (!cpuidle_state_is_coupled(drv, entered_state))
      	local_irq_enable();
      
       [ ... ]
      
      If the 'index' is referring to a coupled idle state but the
      'entered_state' is *not* coupled, then the interrupts are enabled
      again. All CPUs blocked on the sync barrier may busy loop longer
      if the CPU has interrupts to handle before decrementing the
      ready-count. That's consuming more energy than saving.
      
      Fixes: 0b89e9aa (cpuidle: delay enabling interrupts until all coupled CPUs leave idle)
      Signed-off-by: default avatarDaniel Lezcano <daniel.lezcano@linaro.org>
      Cc: 3.15+ <stable@vger.kernel.org> # 3.15+
      [ rjw: Subject & changelog ]
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      e7387da5
  34. Apr 26, 2016
    • Daniel Lezcano's avatar
      cpuidle: Replace ktime_get() with local_clock() · e93e59ce
      Daniel Lezcano authored
      
      The ktime_get() can have a non negligeable overhead, use local_clock()
      instead.
      
      In order to test the difference between ktime_get() and local_clock(),
      a quick hack has been added to trigger, via debugfs, 10000 times a
      call to ktime_get() and local_clock() and measure the elapsed time.
      
      Then the average value, the min and max is computed for each call.
      
      From userspace, the test above was called 100 times every 2 seconds.
      
      So, ktime_get() and local_clock() have been called 1000000 times in
      total.
      
      The results are:
      
      ktime_get():
      ============
       * average: 101 ns (stddev: 27.4)
       * maximum: 38313 ns
       * minimum: 65 ns
      
      local_clock():
      ==============
       * average: 60 ns (stddev: 9.8)
       * maximum: 13487 ns
       * minimum: 46 ns
      
      The local_clock() is faster and more stable.
      
      Even if it is a drop in the ocean, changing the ktime_get() by the
      local_clock() allows to save 80ns at idle time (entry + exit). And
      in some circumstances, especially when there are several CPUs racing
      for the clock access, we save tens of microseconds.
      
      The idle duration resulting from a diff is converted from nanosec to
      microsec. This could be done with integer division (div 1000) - which is
      an expensive operation or by 10 bits shifting (div 1024) - which is fast
      but unprecise.
      
      The following table gives some results at the limits.
      
       ------------------------------------------
      |   nsec   |   div(1000)   |   div(1024)   |
       ------------------------------------------
      |   1e3    |        1 usec |      976 nsec |
       ------------------------------------------
      |   1e6    |     1000 usec |      976 usec |
       ------------------------------------------
      |   1e9    |  1000000 usec |   976562 usec |
       ------------------------------------------
      
      There is a linear deviation of 2.34%. This loss of precision is acceptable
      in the context of the resulting diff which is used for statistics. These
      ones are processed to guess estimate an approximation of the duration of the
      next idle period which ends up into an idle state selection. The selection
      criteria takes into account the next duration based on large intervals,
      represented by the idle state's target residency.
      
      The 2^10 division is enough because the approximation regarding the 1e3
      division is lost in all the approximations done for the next idle duration
      computation.
      
      Signed-off-by: default avatarDaniel Lezcano <daniel.lezcano@linaro.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      [ rjw: Subject ]
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      e93e59ce
  35. Apr 09, 2016
  36. Jan 22, 2016
Loading