Skip to content
Snippets Groups Projects
  1. Dec 15, 2023
    • Kees Cook's avatar
      kernfs: Convert kernfs_path_from_node_locked() from strlcpy() to strscpy() · ff6d413b
      Kees Cook authored
      
      One of the last remaining users of strlcpy() in the kernel is
      kernfs_path_from_node_locked(), which passes back the problematic "length
      we _would_ have copied" return value to indicate truncation.  Convert the
      chain of all callers to use the negative return value (some of which
      already doing this explicitly). All callers were already also checking
      for negative return values, so the risk to missed checks looks very low.
      
      In this analysis, it was found that cgroup1_release_agent() actually
      didn't handle the "too large" condition, so this is technically also a
      bug fix. :)
      
      Here's the chain of callers, and resolution identifying each one as now
      handling the correct return value:
      
      kernfs_path_from_node_locked()
              kernfs_path_from_node()
                      pr_cont_kernfs_path()
                              returns void
                      kernfs_path()
                              sysfs_warn_dup()
                                      return value ignored
                              cgroup_path()
                                      blkg_path()
                                              bfq_bic_update_cgroup()
                                                      return value ignored
                                      TRACE_IOCG_PATH()
                                              return value ignored
                                      TRACE_CGROUP_PATH()
                                              return value ignored
                                      perf_event_cgroup()
                                              return value ignored
                                      task_group_path()
                                              return value ignored
                                      damon_sysfs_memcg_path_eq()
                                              return value ignored
                                      get_mm_memcg_path()
                                              return value ignored
                                      lru_gen_seq_show()
                                              return value ignored
                              cgroup_path_from_kernfs_id()
                                      return value ignored
                      cgroup_show_path()
                              already converted "too large" error to negative value
                      cgroup_path_ns_locked()
                              cgroup_path_ns()
                                      bpf_iter_cgroup_show_fdinfo()
                                              return value ignored
                                      cgroup1_release_agent()
                                              wasn't checking "too large" error
                              proc_cgroup_show()
                                      already converted "too large" to negative value
      
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc:  <cgroups@vger.kernel.org>
      Co-developed-by: default avatarAzeem Shaikh <azeemshaikh38@gmail.com>
      Signed-off-by: default avatarAzeem Shaikh <azeemshaikh38@gmail.com>
      Link: https://lore.kernel.org/r/20231116192127.1558276-3-keescook@chromium.org
      
      
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20231212211741.164376-3-keescook@chromium.org
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ff6d413b
    • Max Kellermann's avatar
      kernel/cgroup: use kernfs_create_dir_ns() · fe3de010
      Max Kellermann authored
      
      By passing the fsugid to kernfs_create_dir_ns(), we don't need
      cgroup_kn_set_ugid() any longer.  That function was added for exactly
      this purpose by commit 49957f8e ("cgroup: newly created dirs and
      files should be owned by the creator").
      
      Eliminating this piece of duplicate code means we benefit from future
      improvements to kernfs_create_dir_ns(); for example, both are lacking
      S_ISGID support currently, which my next patch will add to
      kernfs_create_dir_ns().  It cannot (easily) be added to
      cgroup_kn_set_ugid() because we can't dereference struct kernfs_iattrs
      from there.
      
      --
      v1 -> v2: 12-digit commit id
      
      Signed-off-by: default avatarMax Kellermann <max.kellermann@ionos.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Link: https://lore.kernel.org/r/20231208093310.297233-1-max.kellermann@ionos.com
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fe3de010
  2. Dec 06, 2023
    • Waiman Long's avatar
      cgroup/cpuset: Include isolated cpuset CPUs in cpu_is_isolated() check · 3232e7aa
      Waiman Long authored
      
      Currently, the cpu_is_isolated() function checks only the statically
      isolated CPUs specified via the "isolcpus" and "nohz_full" kernel
      command line options. This function is used by vmstat and memcg to
      reduce interference with isolated CPUs by not doing stat flushing
      or scheduling works on those CPUs.
      
      Workloads running on isolated CPUs within isolated cpuset
      partitions should receive the same treatment to reduce unnecessary
      interference. This patch introduces a new cpuset_cpu_is_isolated()
      function to be called by cpu_is_isolated() so that the set of dynamically
      created cpuset isolated CPUs will be included in the check.
      
      Assuming that testing a bit in a cpumask is atomic, no synchronization
      primitive is currently used to synchronize access to the cpuset's
      isolated_cpus mask.
      
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      3232e7aa
  3. Dec 01, 2023
    • Waiman Long's avatar
      cgroup/rstat: Optimize cgroup_rstat_updated_list() · d499fd41
      Waiman Long authored
      
      The current design of cgroup_rstat_cpu_pop_updated() is to traverse
      the updated tree in a way to pop out the leaf nodes first before
      their parents. This can cause traversal of multiple nodes before a
      leaf node can be found and popped out. IOW, a given node in the tree
      can be visited multiple times before the whole operation is done. So
      it is not very efficient and the code can be hard to read.
      
      With the introduction of cgroup_rstat_updated_list() to build a list
      of cgroups to be flushed first before any flushing operation is being
      done, we can optimize the way the updated tree nodes are being popped
      by pushing the parents first to the tail end of the list before their
      children. In this way, most updated tree nodes will be visited only
      once with the exception of the subtree root as we still need to go
      back to its parent and popped it out of its updated_children list.
      This also makes the code easier to read.
      
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      d499fd41
  4. Nov 28, 2023
    • Tim Van Patten's avatar
      cgroup_freezer: cgroup_freezing: Check if not frozen · cff5f49d
      Tim Van Patten authored
      
      __thaw_task() was recently updated to warn if the task being thawed was
      part of a freezer cgroup that is still currently freezing:
      
      	void __thaw_task(struct task_struct *p)
      	{
      	...
      		if (WARN_ON_ONCE(freezing(p)))
      			goto unlock;
      
      This has exposed a bug in cgroup1 freezing where when CGROUP_FROZEN is
      asserted, the CGROUP_FREEZING bits are not also cleared at the same
      time. Meaning, when a cgroup is marked FROZEN it continues to be marked
      FREEZING as well. This causes the WARNING to trigger, because
      cgroup_freezing() thinks the cgroup is still freezing.
      
      There are two ways to fix this:
      
      1. Whenever FROZEN is set, clear FREEZING for the cgroup and all
      children cgroups.
      2. Update cgroup_freezing() to also verify that FROZEN is not set.
      
      This patch implements option (2), since it's smaller and more
      straightforward.
      
      Signed-off-by: default avatarTim Van Patten <timvp@google.com>
      Tested-by: default avatarMark Hasemeyer <markhas@chromium.org>
      Fixes: f5d39b02 ("freezer,sched: Rewrite core freezer logic")
      Cc: stable@vger.kernel.org # v6.1+
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      cff5f49d
    • Waiman Long's avatar
      cgroup/cpuset: Expose cpuset.cpus.isolated · 877c737d
      Waiman Long authored
      
      The root-only cpuset.cpus.isolated control file shows the current set
      of isolated CPUs in isolated partitions. This control file is currently
      exposed only with the cgroup_debug boot command line option which also
      adds the ".__DEBUG__." prefix. This is actually a useful control file if
      users want to find out which CPUs are currently in an isolated state by
      the cpuset controller. Remove CFTYPE_DEBUG flag for this control file and
      make it available by default without any prefix.
      
      The test_cpuset_prs.sh test script and the cgroup-v2.rst documentation
      file are also updated accordingly. Minor code change is also made in
      test_cpuset_prs.sh to avoid false test failure when running on debug
      kernel.
      
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      877c737d
  5. Nov 14, 2023
    • Johannes Weiner's avatar
      sched: psi: fix unprivileged polling against cgroups · 8b39d20e
      Johannes Weiner authored
      
      519fabc7 ("psi: remove 500ms min window size limitation for
      triggers") breaks unprivileged psi polling on cgroups.
      
      Historically, we had a privilege check for polling in the open() of a
      pressure file in /proc, but were erroneously missing it for the open()
      of cgroup pressure files.
      
      When unprivileged polling was introduced in d82caa27 ("sched/psi:
      Allow unprivileged polling of N*2s period"), it needed to filter
      privileges depending on the exact polling parameters, and as such
      moved the CAP_SYS_RESOURCE check from the proc open() callback to
      psi_trigger_create(). Both the proc files as well as cgroup files go
      through this during write(). This implicitly added the missing check
      for privileges required for HT polling for cgroups.
      
      When 519fabc7 ("psi: remove 500ms min window size limitation for
      triggers") followed right after to remove further restrictions on the
      RT polling window, it incorrectly assumed the cgroup privilege check
      was still missing and added it to the cgroup open(), mirroring what we
      used to do for proc files in the past.
      
      As a result, unprivileged poll requests that would be supported now
      get rejected when opening the cgroup pressure file for writing.
      
      Remove the cgroup open() check. psi_trigger_create() handles it.
      
      Fixes: 519fabc7 ("psi: remove 500ms min window size limitation for triggers")
      Reported-by: default avatarLuca Boccassi <bluca@debian.org>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarLuca Boccassi <bluca@debian.org>
      Acked-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Cc: stable@vger.kernel.org # 6.5+
      Link: https://lore.kernel.org/r/20231026164114.2488682-1-hannes@cmpxchg.org
      8b39d20e
  6. Nov 12, 2023
    • Waiman Long's avatar
      cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked() · e76d28bd
      Waiman Long authored
      
      When cgroup_rstat_updated() isn't being called concurrently with
      cgroup_rstat_flush_locked(), its run time is pretty short. When
      both are called concurrently, the cgroup_rstat_updated() run time
      can spike to a pretty high value due to high cpu_lock hold time in
      cgroup_rstat_flush_locked(). This can be problematic if the task calling
      cgroup_rstat_updated() is a realtime task running on an isolated CPU
      with a strict latency requirement. The cgroup_rstat_updated() call can
      happen when there is a page fault even though the task is running in
      user space most of the time.
      
      The percpu cpu_lock is used to protect the update tree -
      updated_next and updated_children. This protection is only needed when
      cgroup_rstat_cpu_pop_updated() is being called. The subsequent flushing
      operation which can take a much longer time does not need that protection
      as it is already protected by cgroup_rstat_lock.
      
      To reduce the cpu_lock hold time, we need to perform all the
      cgroup_rstat_cpu_pop_updated() calls up front with the lock
      released afterward before doing any flushing. This patch adds a new
      cgroup_rstat_updated_list() function to return a singly linked list of
      cgroups to be flushed.
      
      Some instrumentation code are added to measure the cpu_lock hold time
      right after lock acquisition to after releasing the lock. Parallel
      kernel build on a 2-socket x86-64 server is used as the benchmarking
      tool for measuring the lock hold time.
      
      The maximum cpu_lock hold time before and after the patch are 100us and
      29us respectively. So the worst case time is reduced to about 30% of
      the original. However, there may be some OS or hardware noises like NMI
      or SMI in the test system that can worsen the worst case value. Those
      noises are usually tuned out in a real production environment to get
      a better result.
      
      OTOH, the lock hold time frequency distribution should give a better
      idea of the performance benefit of the patch.  Below were the frequency
      distribution before and after the patch:
      
           Hold time        Before patch       After patch
           ---------        ------------       -----------
             0-01 us           804,139         13,738,708
            01-05 us         9,772,767          1,177,194
            05-10 us         4,595,028              4,984
            10-15 us           303,481              3,562
            15-20 us            78,971              1,314
            20-25 us            24,583                 18
            25-30 us             6,908                 12
            30-40 us             8,015
            40-50 us             2,192
            50-60 us               316
            60-70 us                43
            70-80 us                 7
            80-90 us                 2
              >90 us                 3
      
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Reviewed-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      e76d28bd
    • Waiman Long's avatar
      cgroup/cpuset: Take isolated CPUs out of workqueue unbound cpumask · 72c6303a
      Waiman Long authored
      
      To make CPUs in isolated cpuset partition closer in isolation to
      the boot time isolated CPUs specified in the "isolcpus" boot command
      line option, we need to take those CPUs out of the workqueue unbound
      cpumask so that work functions from the unbound workqueues won't run
      on those CPUs.  Otherwise, they will interfere the user tasks running
      on those isolated CPUs.
      
      With the introduction of the workqueue_unbound_exclude_cpumask() helper
      function in an earlier commit, those isolated CPUs can now be taken
      out from the workqueue unbound cpumask.
      
      This patch also updates cgroup-v2.rst to mention that isolated
      CPUs will be excluded from unbound workqueue cpumask as well as
      updating test_cpuset_prs.sh to verify the correctness of the new
      *cpuset.cpus.isolated file, if available via cgroup_debug option.
      
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      72c6303a
    • Waiman Long's avatar
      cgroup/cpuset: Keep track of CPUs in isolated partitions · 11e5f407
      Waiman Long authored
      
      Add a new internal isolated_cpus mask to keep track of the CPUs that are in
      isolated partitions. Expose that new cpumask as a new root-only control file
      ".cpuset.cpus.isolated".
      
      tj: Updated patch description to reflect dropping __DEBUG__ prefix.
      
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      11e5f407
  7. Nov 09, 2023
    • Yafang Shao's avatar
      cgroup: Add a new helper for cgroup1 hierarchy · aecd408b
      Yafang Shao authored
      
      A new helper is added for cgroup1 hierarchy:
      
      - task_get_cgroup1
        Acquires the associated cgroup of a task within a specific cgroup1
        hierarchy. The cgroup1 hierarchy is identified by its hierarchy ID.
      
      This helper function is added to facilitate the tracing of tasks within
      a particular container or cgroup dir in BPF programs. It's important to
      note that this helper is designed specifically for cgroup1 only.
      
      tj: Use irsqsave/restore as suggested by Hou Tao <houtao@huaweicloud.com>.
      
      Suggested-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Cc: Hou Tao <houtao@huaweicloud.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      aecd408b
    • Yafang Shao's avatar
      cgroup: Add annotation for holding namespace_sem in current_cgns_cgroup_from_root() · 0008454e
      Yafang Shao authored
      When I initially examined the function current_cgns_cgroup_from_root(), I
      was perplexed by its lack of holding cgroup_mutex. However, after Michal
      explained the reason[0] to me, I realized that it already holds the
      namespace_sem. I believe this intricacy could also confuse others, so it
      would be advisable to include an annotation for clarification.
      
      After we replace the cgroup_mutex with RCU read lock, if current doesn't
      hold the namespace_sem, the root cgroup will be NULL. So let's add a
      WARN_ON_ONCE() for it.
      
      [0]. https://lore.kernel.org/bpf/afdnpo3jz2ic2ampud7swd6so5carkilts2mkygcaw67vbw6yh@5b5mncf7qyet
      
      
      
      Signed-off-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Cc: Michal Koutny <mkoutny@suse.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      0008454e
    • Yafang Shao's avatar
      cgroup: Eliminate the need for cgroup_mutex in proc_cgroup_show() · 9067d900
      Yafang Shao authored
      
      The cgroup root_list is already RCU-safe. Therefore, we can replace the
      cgroup_mutex with the RCU read lock in some particular paths. This change
      will be particularly beneficial for frequent operations, such as
      `cat /proc/self/cgroup`, in a cgroup1-based container environment.
      
      I did stress tests with this change, as outlined below
      (with CONFIG_PROVE_RCU_LIST enabled):
      
      - Continuously mounting and unmounting named cgroups in some tasks,
        for example:
      
        cgrp_name=$1
        while true
        do
            mount -t cgroup -o none,name=$cgrp_name none /$cgrp_name
            umount /$cgrp_name
        done
      
      - Continuously triggering proc_cgroup_show() in some tasks concurrently,
        for example:
        while true; do cat /proc/self/cgroup > /dev/null; done
      
      They can ran successfully after implementing this change, with no RCU
      warnings in dmesg.
      
      Signed-off-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      9067d900
    • Yafang Shao's avatar
      cgroup: Make operations on the cgroup root_list RCU safe · d23b5c57
      Yafang Shao authored
      
      At present, when we perform operations on the cgroup root_list, we must
      hold the cgroup_mutex, which is a relatively heavyweight lock. In reality,
      we can make operations on this list RCU-safe, eliminating the need to hold
      the cgroup_mutex during traversal. Modifications to the list only occur in
      the cgroup root setup and destroy paths, which should be infrequent in a
      production environment. In contrast, traversal may occur frequently.
      Therefore, making it RCU-safe would be beneficial.
      
      Signed-off-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      d23b5c57
    • Yafang Shao's avatar
      cgroup: Remove unnecessary list_empty() · 96a2b48e
      Yafang Shao authored
      
      The root hasn't been removed from the root_list, so the list can't be NULL.
      However, if it had been removed, attempting to destroy it once more is not
      possible. Let's replace this with WARN_ON_ONCE() for clarity.
      
      Signed-off-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      96a2b48e
  8. Nov 02, 2023
  9. Oct 20, 2023
  10. Oct 18, 2023
    • Nhat Pham's avatar
      hugetlb: memcg: account hugetlb-backed memory in memory controller · 8cba9576
      Nhat Pham authored
      Currently, hugetlb memory usage is not acounted for in the memory
      controller, which could lead to memory overprotection for cgroups with
      hugetlb-backed memory.  This has been observed in our production system.
      
      For instance, here is one of our usecases: suppose there are two 32G
      containers.  The machine is booted with hugetlb_cma=6G, and each container
      may or may not use up to 3 gigantic page, depending on the workload within
      it.  The rest is anon, cache, slab, etc.  We can set the hugetlb cgroup
      limit of each cgroup to 3G to enforce hugetlb fairness.  But it is very
      difficult to configure memory.max to keep overall consumption, including
      anon, cache, slab etc.  fair.
      
      What we have had to resort to is to constantly poll hugetlb usage and
      readjust memory.max.  Similar procedure is done to other memory limits
      (memory.low for e.g).  However, this is rather cumbersome and buggy. 
      Furthermore, when there is a delay in memory limits correction, (for e.g
      when hugetlb usage changes within consecutive runs of the userspace
      agent), the system could be in an over/underprotected state.
      
      This patch rectifies this issue by charging the memcg when the hugetlb
      folio is utilized, and uncharging when the folio is freed (analogous to
      the hugetlb controller).  Note that we do not charge when the folio is
      allocated to the hugetlb pool, because at this point it is not owned by
      any memcg.
      
      Some caveats to consider:
        * This feature is only available on cgroup v2.
        * There is no hugetlb pool management involved in the memory
          controller. As stated above, hugetlb folios are only charged towards
          the memory controller when it is used. Host overcommit management
          has to consider it when configuring hard limits.
        * Failure to charge towards the memcg results in SIGBUS. This could
          happen even if the hugetlb pool still has pages (but the cgroup
          limit is hit and reclaim attempt fails).
        * When this feature is enabled, hugetlb pages contribute to memory
          reclaim protection. low, min limits tuning must take into account
          hugetlb memory.
        * Hugetlb pages utilized while this option is not selected will not
          be tracked by the memory controller (even if cgroup v2 is remounted
          later on).
      
      Link: https://lkml.kernel.org/r/20231006184629.155543-4-nphamcs@gmail.com
      
      
      Signed-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Frank van der Linden <fvdl@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Tejun heo <tj@kernel.org>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8cba9576
  11. Oct 09, 2023
    • Kamalesh Babulal's avatar
      cgroup: use legacy_name for cgroup v1 disable info · 27a6c5c5
      Kamalesh Babulal authored
      
      cgroup v1 or v2 or both controller names can be passed as arguments to
      the 'cgroup_no_v1' kernel parameter, though most of the controller's
      names are the same for both cgroup versions. This can be confusing when
      both versions are used interchangeably, i.e., passing cgroup_no_v1=io
      
      $ sudo dmesg |grep cgroup
      ...
      cgroup: Disabling io control group subsystem in v1 mounts
      cgroup: Disabled controller 'blkio'
      
      Make it consistent across the pr_info()'s, by using ss->legacy_name, as
      the subsystem name, while printing the cgroup v1 controller disabling
      information in cgroup_init().
      
      Signed-off-by: default avatarKamalesh Babulal <kamalesh.babulal@oracle.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      27a6c5c5
    • Michal Koutný's avatar
      cgroup: Remove duplicates in cgroup v1 tasks file · 1ca0b605
      Michal Koutný authored
      One PID may appear multiple times in a preloaded pidlist.
      (Possibly due to PID recycling but we have reports of the same
      task_struct appearing with different PIDs, thus possibly involving
      transfer of PID via de_thread().)
      
      Because v1 seq_file iterator uses PIDs as position, it leads to
      a message:
      > seq_file: buggy .next function kernfs_seq_next did not update position index
      
      Conservative and quick fix consists of removing duplicates from `tasks`
      file (as opposed to removing pidlists altogether). It doesn't affect
      correctness (it's sufficient to show a PID once), performance impact
      would be hidden by unconditional sorting of the pidlist already in place
      (asymptotically).
      
      Link: https://lore.kernel.org/r/20230823174804.23632-1-mkoutny@suse.com/
      
      
      Suggested-by: default avatarFiro Yang <firo.yang@suse.com>
      Signed-off-by: default avatarMichal Koutný <mkoutny@suse.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      1ca0b605
  12. Oct 04, 2023
  13. Sep 18, 2023
    • Waiman Long's avatar
      cgroup/cpuset: Check partition conflict with housekeeping setup · 4a74e418
      Waiman Long authored
      
      A user can pre-configure certain CPUs in an isolated state at boot time
      with the "isolcpus" kernel boot command line option. Those CPUs will
      not be in the housekeeping_cpumask(HK_TYPE_DOMAIN) and so will not
      be in any sched domains. This may conflict with the partition setup
      at runtime. Those boot time isolated CPUs should only be used in an
      isolated partition.
      
      This patch adds the necessary check and disallows partition setup if the
      check fails.
      
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      4a74e418
    • Waiman Long's avatar
      cgroup/cpuset: Introduce remote partition · 181c8e09
      Waiman Long authored
      
      One can use "cpuset.cpus.partition" to create multiple scheduling domains
      or to produce a set of isolated CPUs where load balancing is disabled.
      The former use case is less common but the latter one can be frequently
      used especially for the Telco use cases like DPDK.
      
      The existing "isolated" partition can be used to produce isolated
      CPUs if the applications have full control of a system. However, in a
      containerized environment where all the apps are run in a container,
      it is hard to distribute out isolated CPUs from the root down given
      the unified hierarchy nature of cgroup v2.
      
      The container running on isolated CPUs can be several layers down from
      the root. The current partition feature requires that all the ancestors
      of a leaf partition root must be parititon roots themselves. This can
      be hard to configure.
      
      This patch introduces a new type of partition called remote partition.
      A remote partition is a partition whose parent is not a partition root
      itself and its CPUs are acquired directly from available CPUs in the
      top cpuset through a hierachical distribution of exclusive CPUs down
      from it.
      
      By contrast, the existing type of partitions where their parents have
      to be valid partition roots are referred to as local partitions as they
      have to be clustered around a parent partition root.
      
      Child local partitons can be created under a remote partition, but
      a remote partition cannot be created under a local partition. We may
      relax this limitation in the future if there are use cases for such
      configuration.
      
      Manually writing to the "cpuset.cpus.exclusive" file is not necessary
      when creating local partitions.  However, writing proper values to
      "cpuset.cpus.exclusive" down the cgroup hierarchy before the target
      remote partition root is mandatory for the creation of a remote
      partition.
      
      The value in "cpuset.cpus.exclusive.effective" may change if its
      "cpuset.cpus" or its parent's "cpuset.cpus.exclusive.effective" changes.
      
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      181c8e09
    • Waiman Long's avatar
      cgroup/cpuset: Add cpuset.cpus.exclusive for v2 · e2ffe502
      Waiman Long authored
      
      This patch introduces a new writable "cpuset.cpus.exclusive" control
      file for v2 which will be added to non-root cpuset enabled cgroups. This new
      file enables user to set a smaller list of exclusive CPUs to be used in
      the creation of a cpuset partition.
      
      The value written to "cpuset.cpus.exclusive" may not be the effective
      value being used for the creation of cpuset partition, the effective
      value will show up in "cpuset.cpus.exclusive.effective" and it is
      subject to the constraint that it must also be a subset of cpus_allowed
      and parent's "cpuset.cpus.exclusive.effective".
      
      By writing to "cpuset.cpus.exclusive", "cpuset.cpus.exclusive.effective"
      may be set to a non-empty value even for cgroups that are not valid
      partition roots yet.
      
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      e2ffe502
    • Waiman Long's avatar
      cgroup/cpuset: Add cpuset.cpus.exclusive.effective for v2 · 0c7f293e
      Waiman Long authored
      
      The creation of a cpuset partition means dedicating a set of exclusive
      CPUs to be used by a particular partition only. These exclusive CPUs
      will not be used by any cpusets outside of that partition.
      
      To enable more flexibility in creating partitions, we need a way to
      distribute exclusive CPUs that can be used in new partitions. Currently,
      we have a subparts_cpus cpumask in struct cpuset that tracks only
      the exclusive CPUs used by all the sub-partitions underneath a given
      cpuset.
      
      This patch reworks the way we do exclusive CPUs tracking. The
      subparts_cpus is now renamed to effective_xcpus which tracks the
      exclusive CPUs allocated to a partition root including those that are
      further distributed down to sub-partitions underneath it. IOW, it also
      includes the exclusive CPUs used by the current partition root. Note
      that effective_xcpus can contain offline CPUs and it will always be a
      subset of cpus_allowed.
      
      The renamed effective_xcpus is now exposed via a new read-only
      "cpuset.cpus.exclusive.effective" control file. The new effective_xcpus
      cpumask should be set to cpus_allowed when a cpuset becomes a partition
      root and be cleared if it is not a valid partition root.
      
      In the next patch, we will enable write to another new control file to
      enable further control of what can get into effective_xcpus.
      
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      0c7f293e
    • Waiman Long's avatar
      cgroup/cpuset: Fix load balance state in update_partition_sd_lb() · 6fcdb018
      Waiman Long authored
      
      Commit a86ce680 ("cgroup/cpuset: Extract out CS_CPU_EXCLUSIVE
      & CS_SCHED_LOAD_BALANCE handling") adds a new helper function
      update_partition_sd_lb() to update the load balance state of the
      cpuset. However the new load balance is determined by just looking at
      whether the cpuset is a valid isolated partition root or not.  That is
      not enough if the cpuset is not a valid partition root but its parent
      is in the isolated state (load balance off). Update the function to
      set the new state to be the same as its parent in this case like what
      has been done in commit c8c92620 ("cgroup/cpuset: Inherit parent's
      load balance state in v2").
      
      Fixes: a86ce680 ("cgroup/cpuset: Extract out CS_CPU_EXCLUSIVE & CS_SCHED_LOAD_BALANCE handling")
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      6fcdb018
    • Kamalesh Babulal's avatar
      cgroup: Avoid extra dereference in css_populate_dir() · d24f0598
      Kamalesh Babulal authored
      
      Use css directly instead of dereferencing it from &cgroup->self, while
      adding the cgroup v2 cft base and psi files in css_populate_dir(). Both
      points to the same css, when css->ss is NULL, this avoids extra deferences
      and makes code consistent in usage across the function.
      
      Signed-off-by: default avatarKamalesh Babulal <kamalesh.babulal@oracle.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      d24f0598
    • Kamalesh Babulal's avatar
      cgroup: Check for ret during cgroup1_base_files cft addition · fd55c0ad
      Kamalesh Babulal authored
      
      There is no check for possible failure while populating
      cgroup1_base_files cft in css_populate_dir(), like its cgroup v2 counter
      parts cgroup_{base,psi}_files.  In case of failure, the cgroup might not
      be set up right.  Add ret value check to return on failure.
      
      Signed-off-by: default avatarKamalesh Babulal <kamalesh.babulal@oracle.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      fd55c0ad
  14. Sep 02, 2023
    • Linus Torvalds's avatar
      cgroup: fix build when CGROUP_SCHED is not enabled · 76be05d4
      Linus Torvalds authored
      
      Sudip Mukherjee reports that the mips sb1250_swarm_defconfig build fails
      with the current kernel.  It isn't actually MIPS-specific, it's just
      that that defconfig does not have CGROUP_SCHED enabled like most configs
      do, and as such shows this error:
      
        kernel/cgroup/cgroup.c: In function 'cgroup_local_stat_show':
        kernel/cgroup/cgroup.c:3699:15: error: implicit declaration of function 'cgroup_tryget_css'; did you mean 'cgroup_tryget'? [-Werror=implicit-function-declaration]
         3699 |         css = cgroup_tryget_css(cgrp, ss);
              |               ^~~~~~~~~~~~~~~~~
              |               cgroup_tryget
        kernel/cgroup/cgroup.c:3699:13: warning: assignment to 'struct cgroup_subsys_state *' from 'int' makes pointer from integer without a cast [-Wint-conversion]
         3699 |         css = cgroup_tryget_css(cgrp, ss);
              |             ^
      
      because cgroup_tryget_css() only exists when CGROUP_SCHED is enabled,
      and the cgroup_local_stat_show() function should similarly be guarded by
      that config option.
      
      Move things around a bit to fix this all.
      
      Fixes: d1d4ff5d ("cgroup: put cgroup_tryget_css() inside CONFIG_CGROUP_SCHED")
      Reported-by: default avatarSudip Mukherjee <sudipm.mukherjee@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      76be05d4
  15. Aug 17, 2023
    • Gustavo A. R. Silva's avatar
      cgroup: Avoid -Wstringop-overflow warnings · 78d44b82
      Gustavo A. R. Silva authored
      Change the notation from pointer-to-array to pointer-to-pointer.
      With this, we avoid the compiler complaining about trying
      to access a region of size zero as an argument during function
      calls.
      
      This is a workaround to prevent the compiler complaining about
      accessing an array of size zero when evaluating the arguments
      of a couple of function calls. See below:
      
      kernel/cgroup/cgroup.c: In function 'find_css_set':
      kernel/cgroup/cgroup.c:1206:16: warning: 'find_existing_css_set' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=]
       1206 |         cset = find_existing_css_set(old_cset, cgrp, template);
            |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      kernel/cgroup/cgroup.c:1206:16: note: referencing argument 3 of type 'struct cgroup_subsys_state *[0]'
      kernel/cgroup/cgroup.c:1071:24: note: in a call to function 'find_existing_css_set'
       1071 | static struct css_set *find_existing_css_set(struct css_set *old_cset,
            |                        ^~~~~~~~~~~~~~~~~~~~~
      
      With the change to pointer-to-pointer, the functions are not prevented
      from being executed, and they will do what they have to do when
      CGROUP_SUBSYS_COUNT == 0.
      
      Address the following -Wstringop-overflow warnings seen when
      built with ARM architecture and aspeed_g4_defconfig configuration
      (notice that under this configuration CGROUP_SUBSYS_COUNT == 0):
      
      kernel/cgroup/cgroup.c:1208:16: warning: 'find_existing_css_set' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=]
      kernel/cgroup/cgroup.c:1258:15: warning: 'css_set_hash' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=]
      kernel/cgroup/cgroup.c:6089:18: warning: 'css_set_hash' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=]
      kernel/cgroup/cgroup.c:6153:18: warning: 'css_set_hash' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=]
      
      This results in no differences in binary output.
      
      Link: https://github.com/KSPP/linux/issues/316
      
      
      Signed-off-by: default avatarGustavo A. R. Silva <gustavoars@kernel.org>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      78d44b82
  16. Aug 15, 2023
  17. Aug 07, 2023
    • Hao Jia's avatar
      cgroup/rstat: Record the cumulative per-cpu time of cgroup and its descendants · 0437719c
      Hao Jia authored
      
      The member variable bstat of the structure cgroup_rstat_cpu
      records the per-cpu time of the cgroup itself, but does not
      include the per-cpu time of its descendants. The per-cpu time
      including descendants is very useful for calculating the
      per-cpu usage of cgroups.
      
      Although we can indirectly obtain the total per-cpu time
      of the cgroup and its descendants by accumulating the per-cpu
      bstat of each descendant of the cgroup. But after a child cgroup
      is removed, we will lose its bstat information. This will cause
      the cumulative value to be non-monotonic, thus affecting
      the accuracy of cgroup per-cpu usage.
      
      So we add the subtree_bstat variable to record the total
      per-cpu time of this cgroup and its descendants, which is
      similar to "cpuacct.usage*" in cgroup v1. And this is
      also helpful for the migration from cgroup v1 to cgroup v2.
      After adding this variable, we can obtain the per-cpu time of
      cgroup and its descendants in user mode through eBPF/drgn, etc.
      And we are still trying to determine how to expose it in the
      cgroupfs interface.
      
      Suggested-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarHao Jia <jiahao.os@bytedance.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      0437719c
    • Miaohe Lin's avatar
      cgroup: clean up if condition in cgroup_pidlist_start() · e7e64a1b
      Miaohe Lin authored
      
      There's no need to use '<=' when knowing 'l->list[mid] != pid' already.
      No functional change intended.
      
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      e7e64a1b
  18. Aug 04, 2023
  19. Aug 02, 2023
  20. Jul 21, 2023
Loading