Skip to content
Snippets Groups Projects
  1. Jul 24, 2024
    • Joel Granados's avatar
      sysctl: treewide: constify the ctl_table argument of proc_handlers · 78eb4ea2
      Joel Granados authored
      
      const qualify the struct ctl_table argument in the proc_handler function
      signatures. This is a prerequisite to moving the static ctl_table
      structs into .rodata data which will ensure that proc_handler function
      pointers cannot be modified.
      
      This patch has been generated by the following coccinelle script:
      
      ```
        virtual patch
      
        @r1@
        identifier ctl, write, buffer, lenp, ppos;
        identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)";
        @@
      
        int func(
        - struct ctl_table *ctl
        + const struct ctl_table *ctl
          ,int write, void *buffer, size_t *lenp, loff_t *ppos);
      
        @r2@
        identifier func, ctl, write, buffer, lenp, ppos;
        @@
      
        int func(
        - struct ctl_table *ctl
        + const struct ctl_table *ctl
          ,int write, void *buffer, size_t *lenp, loff_t *ppos)
        { ... }
      
        @r3@
        identifier func;
        @@
      
        int func(
        - struct ctl_table *
        + const struct ctl_table *
          ,int , void *, size_t *, loff_t *);
      
        @r4@
        identifier func, ctl;
        @@
      
        int func(
        - struct ctl_table *ctl
        + const struct ctl_table *ctl
          ,int , void *, size_t *, loff_t *);
      
        @r5@
        identifier func, write, buffer, lenp, ppos;
        @@
      
        int func(
        - struct ctl_table *
        + const struct ctl_table *
          ,int write, void *buffer, size_t *lenp, loff_t *ppos);
      
      ```
      
      * Code formatting was adjusted in xfs_sysctl.c to comply with code
        conventions. The xfs_stats_clear_proc_handler,
        xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where
        adjusted.
      
      * The ctl_table argument in proc_watchdog_common was const qualified.
        This is called from a proc_handler itself and is calling back into
        another proc_handler, making it necessary to change it as part of the
        proc_handler migration.
      
      Co-developed-by: default avatarThomas Weißschuh <linux@weissschuh.net>
      Signed-off-by: default avatarThomas Weißschuh <linux@weissschuh.net>
      Co-developed-by: default avatarJoel Granados <j.granados@samsung.com>
      Signed-off-by: default avatarJoel Granados <j.granados@samsung.com>
      78eb4ea2
  2. Apr 26, 2024
  3. Apr 24, 2024
  4. Apr 22, 2024
  5. Feb 23, 2024
  6. Feb 22, 2024
    • Alexey Gladkov's avatar
      sysctl: allow change system v ipc sysctls inside ipc namespace · 50ec499b
      Alexey Gladkov authored
      Patch series "Allow to change ipc/mq sysctls inside ipc namespace", v3.
      
      Right now ipc and mq limits count as per ipc namespace, but only real root
      can change them.  By default, the current values of these limits are such
      that it can only be reduced.  Since only root can change the values, it is
      impossible to reduce these limits in the rootless container.
      
      We can allow limit changes within ipc namespace because mq parameters are
      limited by RLIMIT_MSGQUEUE and ipc parameters are not limited to anything
      other than cgroups.
      
      
      This patch (of 3):
      
      Rootless containers are not allowed to modify kernel IPC parameters.
      
      All default limits are set to such high values that in fact there are no
      limits at all.  All limits are not inherited and are initialized to
      default values when a new ipc_namespace is created.
      
      For new ipc_namespace:
      
      size_t       ipc_ns.shm_ctlmax = SHMMAX; // (ULONG_MAX - (1UL << 24))
      size_t       ipc_ns.shm_ctlall = SHMALL; // (ULONG_MAX - (1UL << 24))
      int          ipc_ns.shm_ctlmni = IPCMNI; // (1 << 15)
      int          ipc_ns.shm_rmid_forced = 0;
      unsigned int ipc_ns.msg_ctlmax = MSGMAX; // 8192
      unsigned int ipc_ns.msg_ctlmni = MSGMNI; // 32000
      unsigned int ipc_ns.msg_ctlmnb = MSGMNB; // 16384
      
      The shm_tot (total amount of shared pages) has also ceased to be global,
      it is located in ipc_namespace and is not inherited from anywhere.
      
      In such conditions, it cannot be said that these limits limit anything. 
      The real limiter for them is cgroups.
      
      If we allow rootless containers to change these parameters, then it can
      only be reduced.
      
      Link: https://lkml.kernel.org/r/cover.1705333426.git.legion@kernel.org
      Link: https://lkml.kernel.org/r/d2f4603305cbfed58a24755aa61d027314b73a45.1705333426.git.legion@kernel.org
      
      
      Signed-off-by: default avatarAlexey Gladkov <legion@kernel.org>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Link: https://lkml.kernel.org/r/e2d84d3ec0172cfff759e6065da84ce0cc2736f8.1663756794.git.legion@kernel.org
      
      
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Joel Granados <joel.granados@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      50ec499b
  7. Aug 15, 2023
    • Joel Granados's avatar
      sysctl: Add a size arg to __register_sysctl_table · bff97cf1
      Joel Granados authored
      
      We make these changes in order to prepare __register_sysctl_table and
      its callers for when we remove the sentinel element (empty element at
      the end of ctl_table arrays). We don't actually remove any sentinels in
      this commit, but we *do* make sure to use ARRAY_SIZE so the table_size
      is available when the removal occurs.
      
      We add a table_size argument to __register_sysctl_table and adjust
      callers, all of which pass ctl_table pointers and need an explicit call
      to ARRAY_SIZE. We implement a size calculation in register_net_sysctl in
      order to forward the size of the array pointer received from the network
      register calls.
      
      The new table_size argument does not yet have any effect in the
      init_header call which is still dependent on the sentinel's presence.
      table_size *does* however drive the `kzalloc` allocation in
      __register_sysctl_table with no adverse effects as the allocated memory
      is either one element greater than the calculated ctl_table array (for
      the calls in ipc_sysctl.c, mq_sysctl.c and ucount.c) or the exact size
      of the calculated ctl_table array (for the call from sysctl_net.c and
      register_sysctl). This approach will allows us to "just" remove the
      sentinel without further changes to __register_sysctl_table as
      table_size will represent the exact size for all the callers at that
      point.
      
      Signed-off-by: default avatarJoel Granados <j.granados@samsung.com>
      Signed-off-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      bff97cf1
  8. May 03, 2022
  9. Mar 08, 2022
  10. Nov 09, 2021
  11. Sep 05, 2020
  12. Apr 27, 2020
  13. Jul 19, 2019
  14. Jun 05, 2019
  15. May 15, 2019
    • Manfred Spraul's avatar
      ipc: do cyclic id allocation for the ipc object. · 99db46ea
      Manfred Spraul authored
      For ipcmni_extend mode, the sequence number space is only 7 bits.  So
      the chance of id reuse is relatively high compared with the non-extended
      mode.
      
      To alleviate this id reuse problem, this patch enables cyclic allocation
      for the index to the radix tree (idx).  The disadvantage is that this
      can cause a slight slow-down of the fast path, as the radix tree could
      be higher than necessary.
      
      To limit the radix tree height, I have chosen the following limits:
       1) The cycling is done over in_use*1.5.
       2) At least, the cycling is done over
         "normal" ipcnmi mode: RADIX_TREE_MAP_SIZE elements
         "ipcmni_extended": 4096 elements
      
      Result:
      - for normal mode:
      	No change for <= 42 active ipc elements. With more than 42
      	active ipc elements, a 2nd level would be added to the radix
      	tree.
      	Without cyclic allocation, a 2nd level would be added only with
      	more than 63 active elements.
      
      - for extended mode:
      	Cycling creates always at least a 2-level radix tree.
      	With more than 2730 active objects, a 3rd level would be
      	added, instead of > 4095 active objects until the 3rd level
      	is added without cyclic allocation.
      
      For a 2-level radix tree compared to a 1-level radix tree, I have
      observed < 1% performance impact.
      
      Notes:
      1) Normal "x=semget();y=semget();" is unaffected: Then the idx
        is e.g. a and a+1, regardless if idr_alloc() or idr_alloc_cyclic()
        is used.
      
      2) The -1% happens in a microbenchmark after this situation:
      	x=semget();
      	for(i=0;i<4000;i++) {t=semget();semctl(t,0,IPC_RMID);}
      	y=semget();
      	Now perform semget calls on x and y that do not sleep.
      
      3) The worst-case reuse cycle time is unfortunately unaffected:
         If you have 2^24-1 ipc objects allocated, and get/remove the last
         possible element in a loop, then the id is reused after 128
         get/remove pairs.
      
      Performance check:
      A microbenchmark that performes no-op semop() randomly on two IDs,
      with only these two IDs allocated.
      The IDs were set using /proc/sys/kernel/sem_next_id.
      The test was run 5 times, averages are shown.
      
      1 & 2: Base (6.22 seconds for 10.000.000 semops)
      1 & 40: -0.2%
      1 & 3348: - 0.8%
      1 & 27348: - 1.6%
      1 & 15777204: - 3.2%
      
      Or: ~12.6 cpu cycles per additional radix tree level.
      The cpu is an Intel I3-5010U. ~1300 cpu cycles/syscall is slower
      than what I remember (spectre impact?).
      
      V2 of the patch:
      - use "min" and "max"
      - use RADIX_TREE_MAP_SIZE * RADIX_TREE_MAP_SIZE instead of
      	(2<<12).
      
      [akpm@linux-foundation.org: fix max() warning]
      Link: http://lkml.kernel.org/r/20190329204930.21620-3-longman@redhat.com
      
      
      Signed-off-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Acked-by: default avatarWaiman Long <longman@redhat.com>
      Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      99db46ea
    • Waiman Long's avatar
      ipc: allow boot time extension of IPCMNI from 32k to 16M · 5ac893b8
      Waiman Long authored
      The maximum number of unique System V IPC identifiers was limited to
      32k.  That limit should be big enough for most use cases.
      
      However, there are some users out there requesting for more, especially
      those that are migrating from Solaris which uses 24 bits for unique
      identifiers.  To satisfy the need of those users, a new boot time kernel
      option "ipcmni_extend" is added to extend the IPCMNI value to 16M.  This
      is a 512X increase which should be big enough for users out there that
      need a large number of unique IPC identifier.
      
      The use of this new option will change the pattern of the IPC
      identifiers returned by functions like shmget(2).  An application that
      depends on such pattern may not work properly.  So it should only be
      used if the users really need more than 32k of unique IPC numbers.
      
      This new option does have the side effect of reducing the maximum number
      of unique sequence numbers from 64k down to 128.  So it is a trade-off.
      
      The computation of a new IPC id is not done in the performance critical
      path.  So a little bit of additional overhead shouldn't have any real
      performance impact.
      
      Link: http://lkml.kernel.org/r/20190329204930.21620-1-longman@redhat.com
      
      
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Takashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5ac893b8
  16. Oct 31, 2018
  17. Dec 13, 2014
    • Manfred Spraul's avatar
      ipc/msg: increase MSGMNI, remove scaling · 0050ee05
      Manfred Spraul authored
      
      SysV can be abused to allocate locked kernel memory.  For most systems, a
      small limit doesn't make sense, see the discussion with regards to SHMMAX.
      
      Therefore: increase MSGMNI to the maximum supported.
      
      And: If we ignore the risk of locking too much memory, then an automatic
      scaling of MSGMNI doesn't make sense.  Therefore the logic can be removed.
      
      The code preserves auto_msgmni to avoid breaking any user space applications
      that expect that the value exists.
      
      Notes:
      1) If an administrator must limit the memory allocations, then he can set
      MSGMNI as necessary.
      
      Or he can disable sysv entirely (as e.g. done by Android).
      
      2) MSGMAX and MSGMNB are intentionally not increased, as these values are used
      to control latency vs. throughput:
      If MSGMNB is large, then msgsnd() just returns and more messages can be queued
      before a task switch to a task that calls msgrcv() is forced.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0050ee05
  18. Oct 14, 2014
    • Andrey Vagin's avatar
      ipc: always handle a new value of auto_msgmni · 1195d94e
      Andrey Vagin authored
      
      proc_dointvec_minmax() returns zero if a new value has been set.  So we
      don't need to check all charecters have been handled.
      
      Below you can find two examples.  In the new value has not been handled
      properly.
      
      $ strace ./a.out
      open("/proc/sys/kernel/auto_msgmni", O_WRONLY) = 3
      write(3, "0\n\0", 3)                    = 2
      close(3)                                = 0
      exit_group(0)
      $ cat /sys/kernel/debug/tracing/trace
      
      $strace ./a.out
      open("/proc/sys/kernel/auto_msgmni", O_WRONLY) = 3
      write(3, "0\n", 2)                      = 2
      close(3)                                = 0
      
      $ cat /sys/kernel/debug/tracing/trace
      a.out-697   [000] ....  3280.998235: unregister_ipcns_notifier <-proc_ipcauto_dointvec_minmax
      
      Fixes: 9eefe520 ("ipc: do not use a negative value to re-enable msgmni automatic recomputin")
      Signed-off-by: default avatarAndrey Vagin <avagin@openvz.org>
      Cc: Mathias Krause <minipli@googlemail.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1195d94e
  19. Jun 06, 2014
  20. Apr 07, 2014
  21. Jan 28, 2014
  22. Nov 03, 2013
  23. Jan 05, 2013
    • Stanislav Kinsbursky's avatar
      ipc: add sysctl to specify desired next object id · 03f59566
      Stanislav Kinsbursky authored
      
      Add 3 new variables and sysctls to tune them (by one "next_id" variable
      for messages, semaphores and shared memory respectively).  This variable
      can be used to set desired id for next allocated IPC object.  By default
      it's equal to -1 and old behaviour is preserved.  If this variable is
      non-negative, then desired idr will be extracted from it and used as a
      start value to search for free IDR slot.
      
      Notes:
      
      1) this patch doesn't guarantee that the new object will have desired
         id.  So it's up to user space how to handle new object with wrong id.
      
      2) After a sucessful id allocation attempt, "next_id" will be set back
         to -1 (if it was non-negative).
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: default avatarStanislav Kinsbursky <skinsbursky@parallels.com>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      03f59566
  24. Jul 26, 2011
    • Vasiliy Kulikov's avatar
      ipc: introduce shm_rmid_forced sysctl · b34a6b1d
      Vasiliy Kulikov authored
      
      Add support for the shm_rmid_forced sysctl.  If set to 1, all shared
      memory objects in current ipc namespace will be automatically forced to
      use IPC_RMID.
      
      The POSIX way of handling shmem allows one to create shm objects and
      call shmdt(), leaving shm object associated with no process, thus
      consuming memory not counted via rlimits.
      
      With shm_rmid_forced=1 the shared memory object is counted at least for
      one process, so OOM killer may effectively kill the fat process holding
      the shared memory.
      
      It obviously breaks POSIX - some programs relying on the feature would
      stop working.  So set shm_rmid_forced=1 only if you're sure nobody uses
      "orphaned" memory.  Use shm_rmid_forced=0 by default for compatability
      reasons.
      
      The feature was previously impemented in -ow as a configure option.
      
      [akpm@linux-foundation.org: fix documentation, per Randy]
      [akpm@linux-foundation.org: fix warning]
      [akpm@linux-foundation.org: readability/conventionality tweaks]
      [akpm@linux-foundation.org: fix shm_rmid_forced/shm_forced_rmid confusion, use standard comment layout]
      Signed-off-by: default avatarVasiliy Kulikov <segoon@openwall.com>
      Cc: Randy Dunlap <rdunlap@xenotime.net>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Serge E. Hallyn" <serge.hallyn@canonical.com>
      Cc: Daniel Lezcano <daniel.lezcano@free.fr>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Solar Designer <solar@openwall.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b34a6b1d
  25. Nov 12, 2009
  26. Sep 24, 2009
  27. Apr 03, 2009
  28. Jan 06, 2009
  29. Oct 16, 2008
  30. Jul 25, 2008
    • Nadia Derbey's avatar
      ipc: do not use a negative value to re-enable msgmni automatic recomputing · 9eefe520
      Nadia Derbey authored
      This patch proposes an alternative to the "magical
      positive-versus-negative number trick" Andrew complained about last week
      in http://lkml.org/lkml/2008/6/24/418
      
      .
      
      This had been introduced with the patches that scale msgmni to the amount
      of lowmem.  With these patches, msgmni has a registered notification
      routine that recomputes msgmni value upon memory add/remove or ipc
      namespace creation/ removal.
      
      When msgmni is changed from user space (i.e.  value written to the proc
      file), that notification routine is unregistered, and the way to make it
      registered back is to write a negative value into the proc file.  This is
      the "magical positive-versus-negative number trick".
      
      To fix this, a new proc file is introduced: /proc/sys/kernel/auto_msgmni.
      This file acts as ON/OFF for msgmni automatic recomputing.
      
      With this patch, the process is the following:
      1) kernel boots in "automatic recomputing mode"
         /proc/sys/kernel/msgmni contains the value that has been computed (depends
                                 on lowmem)
         /proc/sys/kernel/automatic_msgmni contains "1"
      
      2) echo <val> > /proc/sys/kernel/msgmni
         . sets msg_ctlmni to <val>
         . de-activates automatic recomputing (i.e. if, say, some memory is added
           msgmni won't be recomputed anymore)
         . /proc/sys/kernel/automatic_msgmni now contains "0"
      
      3) echo "0" > /proc/sys/kernel/automatic_msgmni
         . de-activates msgmni automatic recomputing
           this has the same effect as 2) except that msg_ctlmni's value stays
           blocked at its current value)
      
      3) echo "1" > /proc/sys/kernel/automatic_msgmni
         . recomputes msgmni's value based on the current available memory size
           and number of ipc namespaces
         . re-activates automatic recomputing for msgmni.
      
      Signed-off-by: default avatarNadia Derbey <Nadia.Derbey@bull.net>
      Cc: Solofo Ramangalahy <Solofo.Ramangalahy@bull.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9eefe520
  31. Apr 29, 2008
  32. Feb 08, 2008
    • Pavel Emelyanov's avatar
      namespaces: move the IPC namespace under IPC_NS option · ae5e1b22
      Pavel Emelyanov authored
      
      Currently the IPC namespace management code is spread over the ipc/*.c files.
      I moved this code into ipc/namespace.c file which is compiled out when needed.
      
      The linux/ipc_namespace.h file is used to store the prototypes of the
      functions in namespace.c and the stubs for NAMESPACES=n case.  This is done
      so, because the stub for copy_ipc_namespace requires the knowledge of the
      CLONE_NEWIPC flag, which is in sched.h.  But the linux/ipc.h file itself in
      included into many many .c files via the sys.h->sem.h sequence so adding the
      sched.h into it will make all these .c depend on sched.h which is not that
      good.  On the other hand the knowledge about the namespaces stuff is required
      in 4 .c files only.
      
      Besides, this patch compiles out some auxiliary functions from ipc/sem.c,
      msg.c and shm.c files.  It turned out that moving these functions into
      namespaces.c is not that easy because they use many other calls and macros
      from the original file.  Moving them would make this patch complicated.  On
      the other hand all these functions can be consolidated, so I will send a
      separate patch doing this a bit later.
      
      Signed-off-by: default avatarPavel Emelyanov <xemul@openvz.org>
      Acked-by: default avatarSerge Hallyn <serue@us.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ae5e1b22
Loading