Skip to content
Snippets Groups Projects
  1. Dec 12, 2022
  2. Dec 05, 2022
    • Jann Horn's avatar
      ipc/sem: Fix dangling sem_array access in semtimedop race · b52be557
      Jann Horn authored
      
      When __do_semtimedop() goes to sleep because it has to wait for a
      semaphore value becoming zero or becoming bigger than some threshold, it
      links the on-stack sem_queue to the sem_array, then goes to sleep
      without holding a reference on the sem_array.
      
      When __do_semtimedop() comes back out of sleep, one of two things must
      happen:
      
       a) We prove that the on-stack sem_queue has been disconnected from the
          (possibly freed) sem_array, making it safe to return from the stack
          frame that the sem_queue exists in.
      
       b) We stabilize our reference to the sem_array, lock the sem_array, and
          detach the sem_queue from the sem_array ourselves.
      
      sem_array has RCU lifetime, so for case (b), the reference can be
      stabilized inside an RCU read-side critical section by locklessly
      checking whether the sem_queue is still connected to the sem_array.
      
      However, the current code does the lockless check on sem_queue before
      starting an RCU read-side critical section, so the result of the
      lockless check immediately becomes useless.
      
      Fix it by doing rcu_read_lock() before the lockless check.  Now RCU
      ensures that if we observe the object being on our queue, the object
      can't be freed until rcu_read_unlock().
      
      This bug is only hittable on kernel builds with full preemption support
      (either CONFIG_PREEMPT or PREEMPT_DYNAMIC with preempt=full).
      
      Fixes: 370b262c ("ipc/sem: avoid idr tree lookup for interrupted semop")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b52be557
  3. Nov 23, 2022
    • Mike Kravetz's avatar
      ipc/shm: call underlying open/close vm_ops · b6305049
      Mike Kravetz authored
      Shared memory segments can be created that are backed by hugetlb pages. 
      When this happens, the vmas associated with any mappings (shmat) are
      marked VM_HUGETLB, yet the vm_ops for such mappings are provided by
      ipc/shm (shm_vm_ops).  There is a mechanism to call the underlying hugetlb
      vm_ops, and this is done for most operations.  However, it is not done for
      open and close.
      
      This was not an issue until the introduction of the hugetlb vma_lock. 
      This lock structure is pointed to by vm_private_data and the open/close
      vm_ops help maintain this structure.  The special hugetlb routine called
      at fork took care of structure updates at fork time.  However,
      vma_splitting is not properly handled for ipc shared memory mappings
      backed by hugetlb pages.  This can result in a "kernel NULL pointer
      dereference" BUG or use after free as two vmas point to the same lock
      structure.
      
      Update the shm open and close routines to always call the underlying open
      and close routines.
      
      Link: https://lkml.kernel.org/r/20221114210018.49346-1-mike.kravetz@oracle.com
      
      
      Fixes: 8d9bfb26 ("hugetlb: add vma based lock for pmd sharing")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: default avatarDoug Nelson <doug.nelson@intel.com>
      Reported-by: default avatar <syzbot+83b4134621b7c326d950@syzkaller.appspotmail.com>
      Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b6305049
  4. Oct 28, 2022
  5. Oct 03, 2022
  6. Sep 27, 2022
  7. Sep 12, 2022
  8. Jul 19, 2022
  9. Jul 18, 2022
  10. Jun 22, 2022
    • Alexey Gladkov's avatar
      ipc: Free mq_sysctls if ipc namespace creation failed · db7cfc38
      Alexey Gladkov authored
      
      The problem that Dmitry Vyukov pointed out is that if setup_ipc_sysctls fails,
      mq_sysctls must be freed before return.
      
      executing program
      BUG: memory leak
      unreferenced object 0xffff888112fc9200 (size 512):
        comm "syz-executor237", pid 3648, jiffies 4294970469 (age 12.270s)
        hex dump (first 32 bytes):
          ef d3 60 85 ff ff ff ff 0c 9b d2 12 81 88 ff ff  ..`.............
          04 00 00 00 a4 01 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff814b6eb3>] kmemdup+0x23/0x50 mm/util.c:129
          [<ffffffff82219a9b>] kmemdup include/linux/fortify-string.h:456 [inline]
          [<ffffffff82219a9b>] setup_mq_sysctls+0x4b/0x1c0 ipc/mq_sysctl.c:89
          [<ffffffff822197f2>] create_ipc_ns ipc/namespace.c:63 [inline]
          [<ffffffff822197f2>] copy_ipcs+0x292/0x390 ipc/namespace.c:91
          [<ffffffff8127de7c>] create_new_namespaces+0xdc/0x4f0 kernel/nsproxy.c:90
          [<ffffffff8127e89b>] unshare_nsproxy_namespaces+0x9b/0x120 kernel/nsproxy.c:226
          [<ffffffff8123f92e>] ksys_unshare+0x2fe/0x600 kernel/fork.c:3165
          [<ffffffff8123fc42>] __do_sys_unshare kernel/fork.c:3236 [inline]
          [<ffffffff8123fc42>] __se_sys_unshare kernel/fork.c:3234 [inline]
          [<ffffffff8123fc42>] __x64_sys_unshare+0x12/0x20 kernel/fork.c:3234
          [<ffffffff845aab45>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
          [<ffffffff845aab45>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
          [<ffffffff8460006a>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      BUG: memory leak
      unreferenced object 0xffff888112fd5f00 (size 256):
        comm "syz-executor237", pid 3648, jiffies 4294970469 (age 12.270s)
        hex dump (first 32 bytes):
          00 92 fc 12 81 88 ff ff 00 00 00 00 01 00 00 00  ................
          01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff816fea1b>] kmalloc include/linux/slab.h:605 [inline]
          [<ffffffff816fea1b>] kzalloc include/linux/slab.h:733 [inline]
          [<ffffffff816fea1b>] __register_sysctl_table+0x7b/0x7f0 fs/proc/proc_sysctl.c:1344
          [<ffffffff82219b7a>] setup_mq_sysctls+0x12a/0x1c0 ipc/mq_sysctl.c:112
          [<ffffffff822197f2>] create_ipc_ns ipc/namespace.c:63 [inline]
          [<ffffffff822197f2>] copy_ipcs+0x292/0x390 ipc/namespace.c:91
          [<ffffffff8127de7c>] create_new_namespaces+0xdc/0x4f0 kernel/nsproxy.c:90
          [<ffffffff8127e89b>] unshare_nsproxy_namespaces+0x9b/0x120 kernel/nsproxy.c:226
          [<ffffffff8123f92e>] ksys_unshare+0x2fe/0x600 kernel/fork.c:3165
          [<ffffffff8123fc42>] __do_sys_unshare kernel/fork.c:3236 [inline]
          [<ffffffff8123fc42>] __se_sys_unshare kernel/fork.c:3234 [inline]
          [<ffffffff8123fc42>] __x64_sys_unshare+0x12/0x20 kernel/fork.c:3234
          [<ffffffff845aab45>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
          [<ffffffff845aab45>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
          [<ffffffff8460006a>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      BUG: memory leak
      unreferenced object 0xffff888112fbba00 (size 256):
        comm "syz-executor237", pid 3648, jiffies 4294970469 (age 12.270s)
        hex dump (first 32 bytes):
          78 ba fb 12 81 88 ff ff 00 00 00 00 01 00 00 00  x...............
          01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff816fef49>] kmalloc include/linux/slab.h:605 [inline]
          [<ffffffff816fef49>] kzalloc include/linux/slab.h:733 [inline]
          [<ffffffff816fef49>] new_dir fs/proc/proc_sysctl.c:978 [inline]
          [<ffffffff816fef49>] get_subdir fs/proc/proc_sysctl.c:1022 [inline]
          [<ffffffff816fef49>] __register_sysctl_table+0x5a9/0x7f0 fs/proc/proc_sysctl.c:1373
          [<ffffffff82219b7a>] setup_mq_sysctls+0x12a/0x1c0 ipc/mq_sysctl.c:112
          [<ffffffff822197f2>] create_ipc_ns ipc/namespace.c:63 [inline]
          [<ffffffff822197f2>] copy_ipcs+0x292/0x390 ipc/namespace.c:91
          [<ffffffff8127de7c>] create_new_namespaces+0xdc/0x4f0 kernel/nsproxy.c:90
          [<ffffffff8127e89b>] unshare_nsproxy_namespaces+0x9b/0x120 kernel/nsproxy.c:226
          [<ffffffff8123f92e>] ksys_unshare+0x2fe/0x600 kernel/fork.c:3165
          [<ffffffff8123fc42>] __do_sys_unshare kernel/fork.c:3236 [inline]
          [<ffffffff8123fc42>] __se_sys_unshare kernel/fork.c:3234 [inline]
          [<ffffffff8123fc42>] __x64_sys_unshare+0x12/0x20 kernel/fork.c:3234
          [<ffffffff845aab45>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
          [<ffffffff845aab45>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
          [<ffffffff8460006a>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      BUG: memory leak
      unreferenced object 0xffff888112fbb900 (size 256):
        comm "syz-executor237", pid 3648, jiffies 4294970469 (age 12.270s)
        hex dump (first 32 bytes):
          78 b9 fb 12 81 88 ff ff 00 00 00 00 01 00 00 00  x...............
          01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff816fef49>] kmalloc include/linux/slab.h:605 [inline]
          [<ffffffff816fef49>] kzalloc include/linux/slab.h:733 [inline]
          [<ffffffff816fef49>] new_dir fs/proc/proc_sysctl.c:978 [inline]
          [<ffffffff816fef49>] get_subdir fs/proc/proc_sysctl.c:1022 [inline]
          [<ffffffff816fef49>] __register_sysctl_table+0x5a9/0x7f0 fs/proc/proc_sysctl.c:1373
          [<ffffffff82219b7a>] setup_mq_sysctls+0x12a/0x1c0 ipc/mq_sysctl.c:112
          [<ffffffff822197f2>] create_ipc_ns ipc/namespace.c:63 [inline]
          [<ffffffff822197f2>] copy_ipcs+0x292/0x390 ipc/namespace.c:91
          [<ffffffff8127de7c>] create_new_namespaces+0xdc/0x4f0 kernel/nsproxy.c:90
          [<ffffffff8127e89b>] unshare_nsproxy_namespaces+0x9b/0x120 kernel/nsproxy.c:226
          [<ffffffff8123f92e>] ksys_unshare+0x2fe/0x600 kernel/fork.c:3165
          [<ffffffff8123fc42>] __do_sys_unshare kernel/fork.c:3236 [inline]
          [<ffffffff8123fc42>] __se_sys_unshare kernel/fork.c:3234 [inline]
          [<ffffffff8123fc42>] __x64_sys_unshare+0x12/0x20 kernel/fork.c:3234
          [<ffffffff845aab45>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
          [<ffffffff845aab45>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
          [<ffffffff8460006a>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      Reported-by: default avatar <syzbot+b4b0d1b35442afbf6fd2@syzkaller.appspotmail.com>
      Signed-off-by: default avatarAlexey Gladkov <legion@kernel.org>
      Link: https://lkml.kernel.org/r/000000000000f5004705e1db8bad@google.com
      Link: https://lkml.kernel.org/r/20220622200729.2639663-1-legion@kernel.org
      
      
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      db7cfc38
  11. May 10, 2022
  12. May 03, 2022
  13. Mar 22, 2022
  14. Mar 08, 2022
  15. Feb 04, 2022
  16. Jan 22, 2022
  17. Nov 20, 2021
    • Alexander Mikhalitsyn's avatar
      shm: extend forced shm destroy to support objects from several IPC nses · 85b6d246
      Alexander Mikhalitsyn authored
      Currently, the exit_shm() function not designed to work properly when
      task->sysvshm.shm_clist holds shm objects from different IPC namespaces.
      
      This is a real pain when sysctl kernel.shm_rmid_forced = 1, because it
      leads to use-after-free (reproducer exists).
      
      This is an attempt to fix the problem by extending exit_shm mechanism to
      handle shm's destroy from several IPC ns'es.
      
      To achieve that we do several things:
      
      1. add a namespace (non-refcounted) pointer to the struct shmid_kernel
      
      2. during new shm object creation (newseg()/shmget syscall) we
         initialize this pointer by current task IPC ns
      
      3. exit_shm() fully reworked such that it traverses over all shp's in
         task->sysvshm.shm_clist and gets IPC namespace not from current task
         as it was before but from shp's object itself, then call
         shm_destroy(shp, ns).
      
      Note: We need to be really careful here, because as it was said before
      (1), our pointer to IPC ns non-refcnt'ed.  To be on the safe side we
      using special helper get_ipc_ns_not_zero() which allows to get IPC ns
      refcounter only if IPC ns not in the "state of destruction".
      
      Q/A
      
      Q: Why can we access shp->ns memory using non-refcounted pointer?
      A: Because shp object lifetime is always shorther than IPC namespace
         lifetime, so, if we get shp object from the task->sysvshm.shm_clist
         while holding task_lock(task) nobody can steal our namespace.
      
      Q: Does this patch change semantics of unshare/setns/clone syscalls?
      A: No. It's just fixes non-covered case when process may leave IPC
         namespace without getting task->sysvshm.shm_clist list cleaned up.
      
      Link: https://lkml.kernel.org/r/67bb03e5-f79c-1815-e2bf-949c67047418@colorfullife.com
      Link: https://lkml.kernel.org/r/20211109151501.4921-1-manfred@colorfullife.com
      
      
      Fixes: ab602f79 ("shm: make exit_shm work proportional to task activity")
      Co-developed-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarAlexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
      Cc: Vasily Averin <vvs@virtuozzo.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      85b6d246
    • Alexander Mikhalitsyn's avatar
      ipc: WARN if trying to remove ipc object which is absent · 126e8bee
      Alexander Mikhalitsyn authored
      Patch series "shm: shm_rmid_forced feature fixes".
      
      Some time ago I met kernel crash after CRIU restore procedure,
      fortunately, it was CRIU restore, so, I had dump files and could do
      restore many times and crash reproduced easily.  After some
      investigation I've constructed the minimal reproducer.  It was found
      that it's use-after-free and it happens only if sysctl
      kernel.shm_rmid_forced = 1.
      
      The key of the problem is that the exit_shm() function not handles shp's
      object destroy when task->sysvshm.shm_clist contains items from
      different IPC namespaces.  In most cases this list will contain only
      items from one IPC namespace.
      
      How can this list contain object from different namespaces? The
      exit_shm() function is designed to clean up this list always when
      process leaves IPC namespace.  But we made a mistake a long time ago and
      did not add a exit_shm() call into the setns() syscall procedures.
      
      The first idea was just to add this call to setns() syscall but it
      obviously changes semantics of setns() syscall and that's
      userspace-visible change.  So, I gave up on this idea.
      
      The first real attempt to address the issue was just to omit forced
      destroy if we meet shp object not from current task IPC namespace [1].
      But that was not the best idea because task->sysvshm.shm_clist was
      protected by rwsem which belongs to current task IPC namespace.  It
      means that list corruption may occur.
      
      Second approach is just extend exit_shm() to properly handle shp's from
      different IPC namespaces [2].  This is really non-trivial thing, I've
      put a lot of effort into that but not believed that it's possible to
      make it fully safe, clean and clear.
      
      Thanks to the efforts of Manfred Spraul working an elegant solution was
      designed.  Thanks a lot, Manfred!
      
      Eric also suggested the way to address the issue in ("[RFC][PATCH] shm:
      In shm_exit destroy all created and never attached segments") Eric's
      idea was to maintain a list of shm_clists one per IPC namespace, use
      lock-less lists.  But there is some extra memory consumption-related
      concerns.
      
      An alternative solution which was suggested by me was implemented in
      ("shm: reset shm_clist on setns but omit forced shm destroy").  The idea
      is pretty simple, we add exit_shm() syscall to setns() but DO NOT
      destroy shm segments even if sysctl kernel.shm_rmid_forced = 1, we just
      clean up the task->sysvshm.shm_clist list.
      
      This chages semantics of setns() syscall a little bit but in comparision
      to the "naive" solution when we just add exit_shm() without any special
      exclusions this looks like a safer option.
      
      [1] https://lkml.org/lkml/2021/7/6/1108
      [2] https://lkml.org/lkml/2021/7/14/736
      
      This patch (of 2):
      
      Let's produce a warning if we trying to remove non-existing IPC object
      from IPC namespace kht/idr structures.
      
      This allows us to catch possible bugs when the ipc_rmid() function was
      called with inconsistent struct ipc_ids*, struct kern_ipc_perm*
      arguments.
      
      Link: https://lkml.kernel.org/r/20211027224348.611025-1-alexander.mikhalitsyn@virtuozzo.com
      Link: https://lkml.kernel.org/r/20211027224348.611025-2-alexander.mikhalitsyn@virtuozzo.com
      
      
      Co-developed-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarAlexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
      Cc: Vasily Averin <vvs@virtuozzo.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      126e8bee
  18. Nov 09, 2021
  19. Sep 14, 2021
  20. Sep 08, 2021
    • Rafael Aquini's avatar
      ipc: replace costly bailout check in sysvipc_find_ipc() · 20401d10
      Rafael Aquini authored
      sysvipc_find_ipc() was left with a costly way to check if the offset
      position fed to it is bigger than the total number of IPC IDs in use.  So
      much so that the time it takes to iterate over /proc/sysvipc/* files grows
      exponentially for a custom benchmark that creates "N" SYSV shm segments
      and then times the read of /proc/sysvipc/shm (milliseconds):
      
          12 msecs to read   1024 segs from /proc/sysvipc/shm
          18 msecs to read   2048 segs from /proc/sysvipc/shm
          65 msecs to read   4096 segs from /proc/sysvipc/shm
         325 msecs to read   8192 segs from /proc/sysvipc/shm
        1303 msecs to read  16384 segs from /proc/sysvipc/shm
        5182 msecs to read  32768 segs from /proc/sysvipc/shm
      
      The root problem lies with the loop that computes the total amount of ids
      in use to check if the "pos" feeded to sysvipc_find_ipc() grew bigger than
      "ids->in_use".  That is a quite inneficient way to get to the maximum
      index in the id lookup table, specially when that value is already
      provided by struct ipc_ids.max_idx.
      
      This patch follows up on the optimization introduced via commit
      15df03c8 ("sysvipc: make get_maxid O(1) again") and gets rid of the
      aforementioned costly loop replacing it by a simpler checkpoint based on
      ipc_get_maxidx() returned value, which allows for a smooth linear increase
      in time complexity for the same custom benchmark:
      
           2 msecs to read   1024 segs from /proc/sysvipc/shm
           2 msecs to read   2048 segs from /proc/sysvipc/shm
           4 msecs to read   4096 segs from /proc/sysvipc/shm
           9 msecs to read   8192 segs from /proc/sysvipc/shm
          19 msecs to read  16384 segs from /proc/sysvipc/shm
          39 msecs to read  32768 segs from /proc/sysvipc/shm
      
      Link: https://lkml.kernel.org/r/20210809203554.1562989-1-aquini@redhat.com
      
      
      Signed-off-by: default avatarRafael Aquini <aquini@redhat.com>
      Acked-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Acked-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Cc: Waiman Long <llong@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      20401d10
  21. Sep 03, 2021
    • Vasily Averin's avatar
      memcg: enable accounting of ipc resources · 18319498
      Vasily Averin authored
      When user creates IPC objects it forces kernel to allocate memory for
      these long-living objects.
      
      It makes sense to account them to restrict the host's memory consumption
      from inside the memcg-limited container.
      
      This patch enables accounting for IPC shared memory segments, messages
      semaphores and semaphore's undo lists.
      
      Link: https://lkml.kernel.org/r/d6507b06-4df6-78f8-6c54-3ae86e3b5339@virtuozzo.com
      
      
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jiri Slaby <jirislaby@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Serge Hallyn <serge@hallyn.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Yutian Yang <nglaive@gmail.com>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      18319498
    • Vasily Averin's avatar
      memcg: enable accounting for new namesapces and struct nsproxy · 30acd0bd
      Vasily Averin authored
      Container admin can create new namespaces and force kernel to allocate up
      to several pages of memory for the namespaces and its associated
      structures.
      
      Net and uts namespaces have enabled accounting for such allocations.  It
      makes sense to account for rest ones to restrict the host's memory
      consumption from inside the memcg-limited container.
      
      Link: https://lkml.kernel.org/r/5525bcbf-533e-da27-79b7-158686c64e13@virtuozzo.com
      
      
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Acked-by: default avatarSerge Hallyn <serge@hallyn.com>
      Acked-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Acked-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jiri Slaby <jirislaby@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Yutian Yang <nglaive@gmail.com>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      30acd0bd
  22. Aug 20, 2021
    • Arnd Bergmann's avatar
      ARM: 9114/1: oabi-compat: rework sys_semtimedop emulation · bdec0145
      Arnd Bergmann authored
      
      sys_oabi_semtimedop() is one of the last users of set_fs() on Arm. To
      remove this one, expose the internal code of the actual implementation
      that operates on a kernel pointer and call it directly after copying.
      
      There should be no measurable impact on the normal execution of this
      function, and it makes the overly long function a little shorter, which
      may help readability.
      
      While reworking the oabi version, make it behave a little more like
      the native one, using kvmalloc_array() and restructure the code
      flow in a similar way.
      
      The naming of __do_semtimedop() is not very good, I hope someone can
      come up with a better name.
      
      One regression was spotted by kernel test robot <rong.a.chen@intel.com>
      and fixed before the first mailing list submission.
      
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      bdec0145
  23. Jul 01, 2021
  24. May 23, 2021
    • Varad Gautam's avatar
      ipc/mqueue, msg, sem: avoid relying on a stack reference past its expiry · a11ddb37
      Varad Gautam authored
      do_mq_timedreceive calls wq_sleep with a stack local address.  The
      sender (do_mq_timedsend) uses this address to later call pipelined_send.
      
      This leads to a very hard to trigger race where a do_mq_timedreceive
      call might return and leave do_mq_timedsend to rely on an invalid
      address, causing the following crash:
      
        RIP: 0010:wake_q_add_safe+0x13/0x60
        Call Trace:
         __x64_sys_mq_timedsend+0x2a9/0x490
         do_syscall_64+0x80/0x680
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f5928e40343
      
      The race occurs as:
      
      1. do_mq_timedreceive calls wq_sleep with the address of `struct
         ext_wait_queue` on function stack (aliased as `ewq_addr` here) - it
         holds a valid `struct ext_wait_queue *` as long as the stack has not
         been overwritten.
      
      2. `ewq_addr` gets added to info->e_wait_q[RECV].list in wq_add, and
         do_mq_timedsend receives it via wq_get_first_waiter(info, RECV) to call
         __pipelined_op.
      
      3. Sender calls __pipelined_op::smp_store_release(&this->state,
         STATE_READY).  Here is where the race window begins.  (`this` is
         `ewq_addr`.)
      
      4. If the receiver wakes up now in do_mq_timedreceive::wq_sleep, it
         will see `state == STATE_READY` and break.
      
      5. do_mq_timedreceive returns, and `ewq_addr` is no longer guaranteed
         to be a `struct ext_wait_queue *` since it was on do_mq_timedreceive's
         stack.  (Although the address may not get overwritten until another
         function happens to touch it, which means it can persist around for an
         indefinite time.)
      
      6. do_mq_timedsend::__pipelined_op() still believes `ewq_addr` is a
         `struct ext_wait_queue *`, and uses it to find a task_struct to pass to
         the wake_q_add_safe call.  In the lucky case where nothing has
         overwritten `ewq_addr` yet, `ewq_addr->task` is the right task_struct.
         In the unlucky case, __pipelined_op::wake_q_add_safe gets handed a
         bogus address as the receiver's task_struct causing the crash.
      
      do_mq_timedsend::__pipelined_op() should not dereference `this` after
      setting STATE_READY, as the receiver counterpart is now free to return.
      Change __pipelined_op to call wake_q_add_safe on the receiver's
      task_struct returned by get_task_struct, instead of dereferencing `this`
      which sits on the receiver's stack.
      
      As Manfred pointed out, the race potentially also exists in
      ipc/msg.c::expunge_all and ipc/sem.c::wake_up_sem_queue_prepare.  Fix
      those in the same way.
      
      Link: https://lkml.kernel.org/r/20210510102950.12551-1-varad.gautam@suse.com
      
      
      Fixes: c5b2cbdb ("ipc/mqueue.c: update/document memory barriers")
      Fixes: 8116b54e ("ipc/sem.c: document and update memory barriers")
      Fixes: 0d97a82b ("ipc/msg.c: update and document memory barriers")
      Signed-off-by: default avatarVarad Gautam <varad.gautam@suse.com>
      Reported-by: default avatarMatthias von Faber <matthias.vonfaber@aox-tech.de>
      Acked-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Acked-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a11ddb37
  25. May 07, 2021
Loading