Skip to content
Snippets Groups Projects
  1. Dec 12, 2023
  2. Sep 19, 2023
  3. Aug 21, 2023
    • Aleksa Sarai's avatar
      memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy · 9876cfe8
      Aleksa Sarai authored
      This sysctl has the very unusual behaviour of not allowing any user (even
      CAP_SYS_ADMIN) to reduce the restriction setting, meaning that if you were
      to set this sysctl to a more restrictive option in the host pidns you
      would need to reboot your machine in order to reset it.
      
      The justification given in [1] is that this is a security feature and thus
      it should not be possible to disable.  Aside from the fact that we have
      plenty of security-related sysctls that can be disabled after being
      enabled (fs.protected_symlinks for instance), the protection provided by
      the sysctl is to stop users from being able to create a binary and then
      execute it.  A user with CAP_SYS_ADMIN can trivially do this without
      memfd_create(2):
      
        % cat mount-memfd.c
        #include <fcntl.h>
        #include <string.h>
        #include <stdio.h>
        #include <stdlib.h>
        #include <unistd.h>
        #include <linux/mount.h>
      
        #define SHELLCODE "#!/bin/echo this file was executed from this totally private tmpfs:"
      
        int main(void)
        {
        	int fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC);
        	assert(fsfd >= 0);
        	assert(!fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 2));
      
        	int dfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
        	assert(dfd >= 0);
      
        	int execfd = openat(dfd, "exe", O_CREAT | O_RDWR | O_CLOEXEC, 0782);
        	assert(execfd >= 0);
        	assert(write(execfd, SHELLCODE, strlen(SHELLCODE)) == strlen(SHELLCODE));
        	assert(!close(execfd));
      
        	char *execpath = NULL;
        	char *argv[] = { "bad-exe", NULL }, *envp[] = { NULL };
        	execfd = openat(dfd, "exe", O_PATH | O_CLOEXEC);
        	assert(execfd >= 0);
        	assert(asprintf(&execpath, "/proc/self/fd/%d", execfd) > 0);
        	assert(!execve(execpath, argv, envp));
        }
        % ./mount-memfd
        this file was executed from this totally private tmpfs: /proc/self/fd/5
        %
      
      Given that it is possible for CAP_SYS_ADMIN users to create executable
      binaries without memfd_create(2) and without touching the host filesystem
      (not to mention the many other things a CAP_SYS_ADMIN process would be
      able to do that would be equivalent or worse), it seems strange to cause a
      fair amount of headache to admins when there doesn't appear to be an
      actual security benefit to blocking this.  There appear to be concerns
      about confused-deputy-esque attacks[2] but a confused deputy that can
      write to arbitrary sysctls is a bigger security issue than executable
      memfds.
      
      /* New API */
      
      The primary requirement from the original author appears to be more based
      on the need to be able to restrict an entire system in a hierarchical
      manner[3], such that child namespaces cannot re-enable executable memfds.
      
      So, implement that behaviour explicitly -- the vm.memfd_noexec scope is
      evaluated up the pidns tree to &init_pid_ns and you have the most
      restrictive value applied to you.  The new lower limit you can set
      vm.memfd_noexec is whatever limit applies to your parent.
      
      Note that a pidns will inherit a copy of the parent pidns's effective
      vm.memfd_noexec setting at unshare() time.  This matches the existing
      behaviour, and it also ensures that a pidns will never have its
      vm.memfd_noexec setting *lowered* behind its back (but it will be raised
      if the parent raises theirs).
      
      /* Backwards Compatibility */
      
      As the previous version of the sysctl didn't allow you to lower the
      setting at all, there are no backwards compatibility issues with this
      aspect of the change.
      
      However it should be noted that now that the setting is completely
      hierarchical.  Previously, a cloned pidns would just copy the current
      pidns setting, meaning that if the parent's vm.memfd_noexec was changed it
      wouldn't propoagate to existing pid namespaces.  Now, the restriction
      applies recursively.  This is a uAPI change, however:
      
       * The sysctl is very new, having been merged in 6.3.
       * Several aspects of the sysctl were broken up until this patchset and
         the other patchset by Jeff Xu last month.
      
      And thus it seems incredibly unlikely that any real users would run into
      this issue. In the worst case, if this causes userspace isues we could
      make it so that modifying the setting follows the hierarchical rules but
      the restriction checking uses the cached copy.
      
      [1]: https://lore.kernel.org/CABi2SkWnAgHK1i6iqSqPMYuNEhtHBkO8jUuCvmG3RmUB5TKHJw@mail.gmail.com/
      [2]: https://lore.kernel.org/CALmYWFs_dNCzw_pW1yRAo4bGCPEtykroEQaowNULp7svwMLjOg@mail.gmail.com/
      [3]: https://lore.kernel.org/CALmYWFuahdUF7cT4cm7_TGLqPanuHXJ-hVSfZt7vpTnc18DPrw@mail.gmail.com/
      
      Link: https://lkml.kernel.org/r/20230814-memfd-vm-noexec-uapi-fixes-v2-4-7ff9e3e10ba6@cyphar.com
      
      
      Fixes: 105ff533 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC")
      Signed-off-by: default avatarAleksa Sarai <cyphar@cyphar.com>
      Cc: Dominique Martinet <asmadeus@codewreck.org>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Daniel Verkamp <dverkamp@chromium.org>
      Cc: Jeff Xu <jeffxu@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9876cfe8
  4. Jul 01, 2023
  5. Jun 30, 2023
  6. Apr 03, 2023
    • Christian Brauner's avatar
      pid: add pidfd_prepare() · 6ae930d9
      Christian Brauner authored
      
      Add a new helper that allows to reserve a pidfd and allocates a new
      pidfd file that stashes the provided struct pid. This will allow us to
      remove places that either open code this function or that call
      pidfd_create() but then have to call close_fd() because there are still
      failure points after pidfd_create() has been called.
      
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Message-Id: <20230327-pidfd-file-api-v1-1-5c0e9a3158e4@kernel.org>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      6ae930d9
  7. Jun 29, 2022
    • Andreas Gruenbacher's avatar
      gfs2: Add glockfd debugfs file · 4480c27c
      Andreas Gruenbacher authored
      
      When a process has a gfs2 file open, the file is keeping a reference on the
      underlying gfs2 inode, and the inode is keeping the inode's iopen glock held in
      shared mode.  In other words, the process depends on the iopen glock of each
      open gfs2 file.  Expose those dependencies in a new "glockfd" debugfs file.
      
      The new debugfs file contains one line for each gfs2 file descriptor,
      specifying the tgid, file descriptor number, and glock name, e.g.,
      
        1601 6 5/816d
      
      This list is compiled by iterating all tasks on the system using find_ge_pid(),
      and all file descriptors of each task using task_lookup_next_fd_rcu().  To make
      that work from gfs2, export those two functions.
      
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      4480c27c
  8. Oct 14, 2021
  9. Aug 10, 2021
  10. Dec 10, 2020
    • Eric W. Biederman's avatar
      exec: Transform exec_update_mutex into a rw_semaphore · f7cfd871
      Eric W. Biederman authored
      Recently syzbot reported[0] that there is a deadlock amongst the users
      of exec_update_mutex.  The problematic lock ordering found by lockdep
      was:
      
         perf_event_open  (exec_update_mutex -> ovl_i_mutex)
         chown            (ovl_i_mutex       -> sb_writes)
         sendfile         (sb_writes         -> p->lock)
           by reading from a proc file and writing to overlayfs
         proc_pid_syscall (p->lock           -> exec_update_mutex)
      
      While looking at possible solutions it occured to me that all of the
      users and possible users involved only wanted to state of the given
      process to remain the same.  They are all readers.  The only writer is
      exec.
      
      There is no reason for readers to block on each other.  So fix
      this deadlock by transforming exec_update_mutex into a rw_semaphore
      named exec_update_lock that only exec takes for writing.
      
      Cc: Jann Horn <jannh@google.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Christopher Yeoh <cyeoh@au1.ibm.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Fixes: eea96732 ("exec: Add exec_update_mutex to replace cred_guard_mutex")
      [0] https://lkml.kernel.org/r/00000000000063640c05ade8e3de@google.com
      
      
      Reported-by: default avatar <syzbot+db9cdf3dd1f64252c6ef@syzkaller.appspotmail.com>
      Link: https://lkml.kernel.org/r/87ft4mbqen.fsf@x220.int.ebiederm.org
      
      
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      f7cfd871
  11. Oct 18, 2020
  12. Sep 04, 2020
  13. Aug 19, 2020
    • Kirill Tkhai's avatar
      pid: Use generic ns_common::count · 8eb71d95
      Kirill Tkhai authored
      
      Switch over pid namespaces to use the newly introduced common lifetime
      counter.
      
      Currently every namespace type has its own lifetime counter which is stored
      in the specific namespace struct. The lifetime counters are used
      identically for all namespaces types. Namespaces may of course have
      additional unrelated counters and these are not altered.
      
      This introduces a common lifetime counter into struct ns_common. The
      ns_common struct encompasses information that all namespaces share. That
      should include the lifetime counter since its common for all of them.
      
      It also allows us to unify the type of the counters across all namespaces.
      Most of them use refcount_t but one uses atomic_t and at least one uses
      kref. Especially the last one doesn't make much sense since it's just a
      wrapper around refcount_t since 2016 and actually complicates cleanup
      operations by having to use container_of() to cast the correct namespace
      struct out of struct ns_common.
      
      Having the lifetime counter for the namespaces in one place reduces
      maintenance cost. Not just because after switching all namespaces over we
      will have removed more code than we added but also because the logic is
      more easily understandable and we indicate to the user that the basic
      lifetime requirements for all namespaces are currently identical.
      
      Signed-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Link: https://lore.kernel.org/r/159644979226.604812.7512601754841882036.stgit@localhost.localdomain
      
      
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      8eb71d95
  14. Jul 19, 2020
  15. Jul 13, 2020
  16. Apr 30, 2020
  17. Apr 28, 2020
    • Eric W. Biederman's avatar
      proc: Ensure we see the exit of each process tid exactly once · 6b03d130
      Eric W. Biederman authored
      When the thread group leader changes during exec and the old leaders
      thread is reaped proc_flush_pid will flush the dentries for the entire
      process because the leader still has it's original pid.
      
      Fix this by exchanging the pids in an rcu safe manner,
      and wrapping the code to do that up in a helper exchange_tids.
      
      When I removed switch_exec_pids and introduced this behavior
      in d73d6529 ("[PATCH] pidhash: kill switch_exec_pids") there
      really was nothing that cared as flushing happened with
      the cached dentry and de_thread flushed both of them on exec.
      
      This lack of fully exchanging pids became a problem a few months later
      when I introduced 48e6484d ("[PATCH] proc: Rewrite the proc dentry
      flush on exit optimization").  Which overlooked the de_thread case
      was no longer swapping pids, and I was looking up proc dentries
      by task->pid.
      
      The current behavior isn't properly a bug as everything in proc will
      continue to work correctly just a little bit less efficiently.  Fix
      this just so there are no little surprise corner cases waiting to bite
      people.
      
      -- Oleg points out this could be an issue in next_tgid in proc where
         has_group_leader_pid is called, and reording some of the assignments
         should fix that.
      
      -- Oleg points out this will break the 10 year old hack in __exit_signal.c
      >	/*
      >	 * This can only happen if the caller is de_thread().
      >	 * FIXME: this is the temporary hack, we should teach
      >	 * posix-cpu-timers to handle this case correctly.
      >	 */
      >	if (unlikely(has_group_leader_pid(tsk)))
      >		posix_cpu_timers_exit_group(tsk);
      
      The code in next_tgid has been changed to use PIDTYPE_TGID,
      and the posix cpu timers code has been fixed so it does not
      need the 10 year old hack, so this should be safe to merge
      now.
      
      Link: https://lore.kernel.org/lkml/87h7x3ajll.fsf_-_@x220.int.ebiederm.org/
      
      
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Fixes: 48e6484d ("[PATCH] proc: Rewrite the proc dentry flush on exit optimization").
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      6b03d130
  18. Apr 09, 2020
    • Eric W. Biederman's avatar
      proc: Use a dedicated lock in struct pid · 63f818f4
      Eric W. Biederman authored
      
      syzbot wrote:
      > ========================================================
      > WARNING: possible irq lock inversion dependency detected
      > 5.6.0-syzkaller #0 Not tainted
      > --------------------------------------------------------
      > swapper/1/0 just changed the state of lock:
      > ffffffff898090d8 (tasklist_lock){.+.?}-{2:2}, at: send_sigurg+0x9f/0x320 fs/fcntl.c:840
      > but this lock took another, SOFTIRQ-unsafe lock in the past:
      >  (&pid->wait_pidfd){+.+.}-{2:2}
      >
      >
      > and interrupts could create inverse lock ordering between them.
      >
      >
      > other info that might help us debug this:
      >  Possible interrupt unsafe locking scenario:
      >
      >        CPU0                    CPU1
      >        ----                    ----
      >   lock(&pid->wait_pidfd);
      >                                local_irq_disable();
      >                                lock(tasklist_lock);
      >                                lock(&pid->wait_pidfd);
      >   <Interrupt>
      >     lock(tasklist_lock);
      >
      >  *** DEADLOCK ***
      >
      > 4 locks held by swapper/1/0:
      
      The problem is that because wait_pidfd.lock is taken under the tasklist
      lock.  It must always be taken with irqs disabled as tasklist_lock can be
      taken from interrupt context and if wait_pidfd.lock was already taken this
      would create a lock order inversion.
      
      Oleg suggested just disabling irqs where I have added extra calls to
      wait_pidfd.lock.  That should be safe and I think the code will eventually
      do that.  It was rightly pointed out by Christian that sharing the
      wait_pidfd.lock was a premature optimization.
      
      It is also true that my pre-merge window testing was insufficient.  So
      remove the premature optimization and give struct pid a dedicated lock of
      it's own for struct pid things.  I have verified that lockdep sees all 3
      paths where we take the new pid->lock and lockdep does not complain.
      
      It is my current day dream that one day pid->lock can be used to guard the
      task lists as well and then the tasklist_lock won't need to be held to
      deliver signals.  That will require taking pid->lock with irqs disabled.
      
      Acked-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Link: https://lore.kernel.org/lkml/00000000000011d66805a25cd73f@google.com/
      
      
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Reported-by: default avatar <syzbot+343f75cdeea091340956@syzkaller.appspotmail.com>
      Reported-by: default avatar <syzbot+832aabf700bc3ec920b9@syzkaller.appspotmail.com>
      Reported-by: default avatar <syzbot+f675f964019f884dbd0f@syzkaller.appspotmail.com>
      Reported-by: default avatar <syzbot+a9fb1457d720a55d6dc5@syzkaller.appspotmail.com>
      Fixes: 7bc3e6e5 ("proc: Use a list of inodes to flush from proc")
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      63f818f4
  19. Mar 25, 2020
  20. Mar 09, 2020
    • Christian Brauner's avatar
      pid: make ENOMEM return value more obvious · 10dab84c
      Christian Brauner authored
      
      The alloc_pid() codepath used to be simpler. With the introducation of the
      ability to choose specific pids in 49cb2fc4 ("fork: extend clone3() to
      support setting a PID") it got more complex. It hasn't been super obvious
      that ENOMEM is returned when the pid namespace init process/child subreaper
      of the pid namespace has died. As can be seen from multiple attempts to
      improve this see e.g. [1] and most recently [2].
      We regressed returning ENOMEM in [3] and [2] restored it. Let's add a
      comment on top explaining that this is historic and documented behavior and
      cannot easily be changed.
      
      [1]: 35f71bc0 ("fork: report pid reservation failure properly")
      [2]: b26ebfe1 ("pid: Fix error return value in some cases")
      [3]: 49cb2fc4 ("fork: extend clone3() to support setting a PID")
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      10dab84c
  21. Mar 08, 2020
  22. Feb 28, 2020
    • Eric W. Biederman's avatar
      proc: Remove the now unnecessary internal mount of proc · 69879c01
      Eric W. Biederman authored
      
      There remains no more code in the kernel using pids_ns->proc_mnt,
      therefore remove it from the kernel.
      
      The big benefit of this change is that one of the most error prone and
      tricky parts of the pid namespace implementation, maintaining kernel
      mounts of proc is removed.
      
      In addition removing the unnecessary complexity of the kernel mount
      fixes a regression that caused the proc mount options to be ignored.
      Now that the initial mount of proc comes from userspace, those mount
      options are again honored.  This fixes Android's usage of the proc
      hidepid option.
      
      Reported-by: default avatarAlistair Strachan <astrachan@google.com>
      Fixes: e94591d0 ("proc: Convert proc_mount to use mount_ns.")
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      69879c01
  23. Feb 24, 2020
    • Eric W. Biederman's avatar
      proc: Use a list of inodes to flush from proc · 7bc3e6e5
      Eric W. Biederman authored
      
      Rework the flushing of proc to use a list of directory inodes that
      need to be flushed.
      
      The list is kept on struct pid not on struct task_struct, as there is
      a fixed connection between proc inodes and pids but at least for the
      case of de_thread the pid of a task_struct changes.
      
      This removes the dependency on proc_mnt which allows for different
      mounts of proc having different mount options even in the same pid
      namespace and this allows for the removal of proc_mnt which will
      trivially the first mount of proc to honor it's mount options.
      
      This flushing remains an optimization.  The functions
      pid_delete_dentry and pid_revalidate ensure that ordinary dcache
      management will not attempt to use dentries past the point their
      respective task has died.  When unused the shrinker will
      eventually be able to remove these dentries.
      
      There is a case in de_thread where proc_flush_pid can be
      called early for a given pid.  Which winds up being
      safe (if suboptimal) as this is just an optiimization.
      
      Only pid directories are put on the list as the other
      per pid files are children of those directories and
      d_invalidate on the directory will get them as well.
      
      So that the pid can be used during flushing it's reference count is
      taken in release_task and dropped in proc_flush_pid.  Further the call
      of proc_flush_pid is moved after the tasklist_lock is released in
      release_task so that it is certain that the pid has already been
      unhashed when flushing it taking place.  This removes a small race
      where a dentry could recreated.
      
      As struct pid is supposed to be small and I need a per pid lock
      I reuse the only lock that currently exists in struct pid the
      the wait_pidfd.lock.
      
      The net result is that this adds all of this functionality
      with just a little extra list management overhead and
      a single extra pointer in struct pid.
      
      v2: Initialize pid->inodes.  I somehow failed to get that
          initialization into the initial version of the patch.  A boot
          failure was reported by "kernel test robot <lkp@intel.com>", and
          failure to initialize that pid->inodes matches all of the reported
          symptoms.
      
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      7bc3e6e5
  24. Jan 13, 2020
    • Sargun Dhillon's avatar
      pid: Implement pidfd_getfd syscall · 8649c322
      Sargun Dhillon authored
      
      This syscall allows for the retrieval of file descriptors from other
      processes, based on their pidfd. This is possible using ptrace, and
      injection of parasitic code to inject code which leverages SCM_RIGHTS
      to move file descriptors between a tracee and a tracer. Unfortunately,
      ptrace comes with a high cost of requiring the process to be stopped,
      and breaks debuggers. This does not require stopping the process under
      manipulation.
      
      One reason to use this is to allow sandboxers to take actions on file
      descriptors on the behalf of another process. For example, this can be
      combined with seccomp-bpf's user notification to do on-demand fd
      extraction and take privileged actions. One such privileged action
      is binding a socket to a privileged port.
      
      /* prototype */
        /* flags is currently reserved and should be set to 0 */
        int sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);
      
      /* testing */
      Ran self-test suite on x86_64
      
      Signed-off-by: default avatarSargun Dhillon <sargun@sargun.me>
      Acked-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarArnd Bergmann <arnd@arndb.de>
      Link: https://lore.kernel.org/r/20200107175927.4558-3-sargun@sargun.me
      
      
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      8649c322
  25. Nov 15, 2019
    • Adrian Reber's avatar
      fork: extend clone3() to support setting a PID · 49cb2fc4
      Adrian Reber authored
      
      The main motivation to add set_tid to clone3() is CRIU.
      
      To restore a process with the same PID/TID CRIU currently uses
      /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
      ns_last_pid and then (quickly) does a clone(). This works most of the
      time, but it is racy. It is also slow as it requires multiple syscalls.
      
      Extending clone3() to support *set_tid makes it possible restore a
      process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
      race free (as long as the desired PID/TID is available).
      
      This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
      on clone3() with *set_tid as they are currently in place for ns_last_pid.
      
      The original version of this change was using a single value for
      set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
      decided to change set_tid to an array to enable setting the PID of a
      process in multiple PID namespaces at the same time. If a process is
      created in a PID namespace it is possible to influence the PID inside
      and outside of the PID namespace. Details also in the corresponding
      selftest.
      
      To create a process with the following PIDs:
      
            PID NS level         Requested PID
              0 (host)              31496
              1                        42
              2                         1
      
      For that example the two newly introduced parameters to struct
      clone_args (set_tid and set_tid_size) would need to be:
      
        set_tid[0] = 1;
        set_tid[1] = 42;
        set_tid[2] = 31496;
        set_tid_size = 3;
      
      If only the PIDs of the two innermost nested PID namespaces should be
      defined it would look like this:
      
        set_tid[0] = 1;
        set_tid[1] = 42;
        set_tid_size = 2;
      
      The PID of the newly created process would then be the next available
      free PID in the PID namespace level 0 (host) and 42 in the PID namespace
      at level 1 and the PID of the process in the innermost PID namespace
      would be 1.
      
      The set_tid array is used to specify the PID of a process starting
      from the innermost nested PID namespaces up to set_tid_size PID namespaces.
      
      set_tid_size cannot be larger then the current PID namespace level.
      
      Signed-off-by: default avatarAdrian Reber <areber@redhat.com>
      Reviewed-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatarDmitry Safonov <0x7f454c46@gmail.com>
      Acked-by: default avatarAndrei Vagin <avagin@gmail.com>
      Link: https://lore.kernel.org/r/20191115123621.142252-1-areber@redhat.com
      
      
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      49cb2fc4
  26. Oct 17, 2019
  27. Jul 17, 2019
    • Joel Fernandes (Google)'s avatar
      kernel/pid.c: convert struct pid count to refcount_t · f57e515a
      Joel Fernandes (Google) authored
      struct pid's count is an atomic_t field used as a refcount.  Use
      refcount_t for it which is basically atomic_t but does additional
      checking to prevent use-after-free bugs.
      
      For memory ordering, the only change is with the following:
      
       -	if ((atomic_read(&pid->count) == 1) ||
       -	     atomic_dec_and_test(&pid->count)) {
       +	if (refcount_dec_and_test(&pid->count)) {
       		kmem_cache_free(ns->pid_cachep, pid);
      
      Here the change is from: Fully ordered --> RELEASE + ACQUIRE (as per
      refcount-vs-atomic.rst) This ACQUIRE should take care of making sure the
      free happens after the refcount_dec_and_test().
      
      The above hunk also removes atomic_read() since it is not needed for the
      code to work and it is unclear how beneficial it is.  The removal lets
      refcount_dec_and_test() check for cases where get_pid() happened before
      the object was freed.
      
      Link: http://lkml.kernel.org/r/20190701183826.191936-1-joel@joelfernandes.org
      
      
      Signed-off-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Reviewed-by: default avatarAndrea Parri <andrea.parri@amarulasolutions.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Elena Reshetova <elena.reshetova@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: KJ Tsanaktsidis <ktsanaktsidis@zendesk.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f57e515a
  28. Jun 28, 2019
    • Christian Brauner's avatar
      pid: add pidfd_open() · 32fcb426
      Christian Brauner authored
      
      This adds the pidfd_open() syscall. It allows a caller to retrieve pollable
      pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a
      process that is created via traditional fork()/clone() calls that is only
      referenced by a PID:
      
      int pidfd = pidfd_open(1234, 0);
      ret = pidfd_send_signal(pidfd, SIGSTOP, NULL, 0);
      
      With the introduction of pidfds through CLONE_PIDFD it is possible to
      created pidfds at process creation time.
      However, a lot of processes get created with traditional PID-based calls
      such as fork() or clone() (without CLONE_PIDFD). For these processes a
      caller can currently not create a pollable pidfd. This is a problem for
      Android's low memory killer (LMK) and service managers such as systemd.
      Both are examples of tools that want to make use of pidfds to get reliable
      notification of process exit for non-parents (pidfd polling) and race-free
      signal sending (pidfd_send_signal()). They intend to switch to this API for
      process supervision/management as soon as possible. Having no way to get
      pollable pidfds from PID-only processes is one of the biggest blockers for
      them in adopting this api. With pidfd_open() making it possible to retrieve
      pidfds for PID-based processes we enable them to adopt this api.
      
      In line with Arnd's recent changes to consolidate syscall numbers across
      architectures, I have added the pidfd_open() syscall to all architectures
      at the same time.
      
      Signed-off-by: default avatarChristian Brauner <christian@brauner.io>
      Reviewed-by: default avatarDavid Howells <dhowells@redhat.com>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jann Horn <jannh@google.com>
      Cc: Andy Lutomirsky <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-api@vger.kernel.org
      32fcb426
    • Joel Fernandes (Google)'s avatar
      pidfd: add polling support · b53b0b9d
      Joel Fernandes (Google) authored
      
      This patch adds polling support to pidfd.
      
      Android low memory killer (LMK) needs to know when a process dies once
      it is sent the kill signal. It does so by checking for the existence of
      /proc/pid which is both racy and slow. For example, if a PID is reused
      between when LMK sends a kill signal and checks for existence of the
      PID, since the wrong PID is now possibly checked for existence.
      Using the polling support, LMK will be able to get notified when a process
      exists in race-free and fast way, and allows the LMK to do other things
      (such as by polling on other fds) while awaiting the process being killed
      to die.
      
      For notification to polling processes, we follow the same existing
      mechanism in the kernel used when the parent of the task group is to be
      notified of a child's death (do_notify_parent). This is precisely when the
      tasks waiting on a poll of pidfd are also awakened in this patch.
      
      We have decided to include the waitqueue in struct pid for the following
      reasons:
      1. The wait queue has to survive for the lifetime of the poll. Including
         it in task_struct would not be option in this case because the task can
         be reaped and destroyed before the poll returns.
      
      2. By including the struct pid for the waitqueue means that during
         de_thread(), the new thread group leader automatically gets the new
         waitqueue/pid even though its task_struct is different.
      
      Appropriate test cases are added in the second patch to provide coverage of
      all the cases the patch is handling.
      
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Jonathan Kowalski <bl0pbl33p@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: kernel-team@android.com
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Co-developed-by: default avatarDaniel Colascione <dancol@google.com>
      Signed-off-by: default avatarDaniel Colascione <dancol@google.com>
      Signed-off-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: default avatarChristian Brauner <christian@brauner.io>
      b53b0b9d
  29. May 21, 2019
  30. May 15, 2019
  31. Dec 28, 2018
  32. Oct 31, 2018
  33. Sep 20, 2018
    • KJ Tsanaktsidis's avatar
      fork: report pid exhaustion correctly · f83606f5
      KJ Tsanaktsidis authored
      Make the clone and fork syscalls return EAGAIN when the limit on the
      number of pids /proc/sys/kernel/pid_max is exceeded.
      
      Currently, when the pid_max limit is exceeded, the kernel will return
      ENOSPC from the fork and clone syscalls.  This is contrary to the
      documented behaviour, which explicitly calls out the pid_max case as one
      where EAGAIN should be returned.  It also leads to really confusing error
      messages in userspace programs which will complain about a lack of disk
      space when they fail to create processes/threads for this reason.
      
      This error is being returned because alloc_pid() uses the idr api to find
      a new pid; when there are none available, idr_alloc_cyclic() returns
      -ENOSPC, and this is being propagated back to userspace.
      
      This behaviour has been broken before, and was explicitly fixed in
      commit 35f71bc0 ("fork: report pid reservation failure properly"),
      so I think -EAGAIN is definitely the right thing to return in this case.
      The current behaviour change dates from commit 95846ecf ("pid:
      replace pid bitmap implementation with IDR AIP") and was I believe
      unintentional.
      
      This patch has no impact on the case where allocating a pid fails because
      the child reaper for the namespace is dead; that case will still return
      -ENOMEM.
      
      Link: http://lkml.kernel.org/r/20180903111016.46461-1-ktsanaktsidis@zendesk.com
      
      
      Fixes: 95846ecf ("pid: replace pid bitmap implementation with IDR AIP")
      Signed-off-by: default avatarKJ Tsanaktsidis <ktsanaktsidis@zendesk.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Gargi Sharma <gs051095@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f83606f5
  34. Jul 21, 2018
    • Eric W. Biederman's avatar
      pid: Implement PIDTYPE_TGID · 6883f81a
      Eric W. Biederman authored
      
      Everywhere except in the pid array we distinguish between a tasks pid and
      a tasks tgid (thread group id).  Even in the enumeration we want that
      distinction sometimes so we have added __PIDTYPE_TGID.  With leader_pid
      we almost have an implementation of PIDTYPE_TGID in struct signal_struct.
      
      Add PIDTYPE_TGID as a first class member of the pid_type enumeration and
      into the pids array.  Then remove the __PIDTYPE_TGID special case and the
      leader_pid in signal_struct.
      
      The net size increase is just an extra pointer added to struct pid and
      an extra pair of pointers of an hlist_node added to task_struct.
      
      The effect on code maintenance is the removal of a number of special
      cases today and the potential to remove many more special cases as
      PIDTYPE_TGID gets used to it's fullest.  The long term potential
      is allowing zombie thread group leaders to exit, which will remove
      a lot more special cases in the code.
      
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      6883f81a
    • Eric W. Biederman's avatar
      pids: Move the pgrp and session pid pointers from task_struct to signal_struct · 2c470475
      Eric W. Biederman authored
      
      To access these fields the code always has to go to group leader so
      going to signal struct is no loss and is actually a fundamental simplification.
      
      This saves a little bit of memory by only allocating the pid pointer array
      once instead of once for every thread, and even better this removes a
      few potential races caused by the fact that group_leader can be changed
      by de_thread, while signal_struct can not.
      
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      2c470475
    • Eric W. Biederman's avatar
      pids: Compute task_tgid using signal->leader_pid · 7a36094d
      Eric W. Biederman authored
      
      The cost is the the same and this removes the need
      to worry about complications that come from de_thread
      and group_leader changing.
      
      __task_pid_nr_ns has been updated to take advantage of this change.
      
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      7a36094d
Loading