Skip to content
Snippets Groups Projects
  1. Aug 21, 2023
    • Elena Reshetova's avatar
      nsproxy: Convert nsproxy.count to refcount_t · 2ddd3cac
      Elena Reshetova authored
      
      atomic_t variables are currently used to implement reference counters
      with the following properties:
       - counter is initialized to 1 using atomic_set()
       - a resource is freed upon counter reaching zero
       - once counter reaches zero, its further
         increments aren't allowed
       - counter schema uses basic atomic operations
         (set, inc, inc_not_zero, dec_and_test, etc.)
      
      Such atomic variables should be converted to a newly provided
      refcount_t type and API that prevents accidental counter overflows and
      underflows. This is important since overflows and underflows can lead
      to use-after-free situation and be exploitable.
      
      The variable nsproxy.count is used as pure reference counter. Convert it
      to refcount_t and fix up the operations.
      
      **Important note for maintainers:
      
      Some functions from refcount_t API defined in refcount.h have different
      memory ordering guarantees than their atomic counterparts. Please check
      Documentation/core-api/refcount-vs-atomic.rst for more information.
      
      Normally the differences should not matter since refcount_t provides
      enough guarantees to satisfy the refcounting use cases, but in some
      rare cases it might matter. Please double check that you don't have
      some undocumented memory guarantees for this variable usage.
      
      For the nsproxy.count it might make a difference in following places:
       - put_nsproxy() and switch_task_namespaces(): decrement in
         refcount_dec_and_test() only provides RELEASE ordering and ACQUIRE
         ordering on success vs. fully ordered atomic counterpart
      
      Suggested-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarElena Reshetova <elena.reshetova@intel.com>
      Reviewed-by: default avatarDavid Windsor <dwindsor@gmail.com>
      Reviewed-by: default avatarHans Liljestrand <ishkamiel@gmail.com>
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Link: https://lore.kernel.org/r/20230818041327.gonna.210-kees@kernel.org
      
      
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      2ddd3cac
  2. Apr 21, 2023
  3. Oct 25, 2022
    • Andrei Vagin's avatar
      fs/exec: switch timens when a task gets a new mm · 2b5f9dad
      Andrei Vagin authored
      
      Changing a time namespace requires remapping a vvar page, so we don't want
      to allow doing that if any other tasks can use the same mm.
      
      Currently, we install a time namespace when a task is created with a new
      vm. exec() is another case when a task gets a new mm and so it can switch
      a time namespace safely, but it isn't handled now.
      
      One more issue of the current interface is that clone() with CLONE_VM isn't
      allowed if the current task has unshared a time namespace
      (timens_for_children doesn't match the current timens).
      
      Both these issues make some inconvenience for users. For example, Alexey
      and Florian reported that posix_spawn() uses vfork+exec and this pattern
      doesn't work with time namespaces due to the both described issues.
      LXC needed to workaround the exec() issue by calling setns.
      
      In the commit 133e2d3e ("fs/exec: allow to unshare a time namespace on
      vfork+exec"), we tried to fix these issues with minimal impact on UAPI. But
      it adds extra complexity and some undesirable side effects. Eric suggested
      fixing the issues properly because here are all the reasons to suppose that
      there are no users that depend on the old behavior.
      
      Cc: Alexey Izbyshev <izbyshev@ispras.ru>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Suggested-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Origin-author: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrei Vagin <avagin@gmail.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20220921003120.209637-1-avagin@google.com
      2b5f9dad
  4. Sep 13, 2022
    • Andrei Vagin's avatar
      Revert "fs/exec: allow to unshare a time namespace on vfork+exec" · 33a2d6bc
      Andrei Vagin authored
      
      This reverts commit 133e2d3e.
      
      Alexey pointed out a few undesirable side effects of the reverted change.
      First, it doesn't take into account that CLONE_VFORK can be used with
      CLONE_THREAD. Second, a child process doesn't enter a target time name-space,
      if its parent dies before the child calls exec. It happens because the parent
      clears vfork_done.
      
      Eric W. Biederman suggests installing a time namespace as a task gets a new mm.
      It includes all new processes cloned without CLONE_VM and all tasks that call
      exec(). This is an user API change, but we think there aren't users that depend
      on the old behavior.
      
      It is too late to make such changes in this release, so let's roll back
      this patch and introduce the right one in the next release.
      
      Cc: Alexey Izbyshev <izbyshev@ispras.ru>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarAndrei Vagin <avagin@gmail.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20220913102551.1121611-3-avagin@google.com
      33a2d6bc
  5. Jun 15, 2022
  6. Sep 03, 2021
  7. Nov 19, 2020
  8. Nov 18, 2020
  9. Jul 08, 2020
  10. Jun 16, 2020
  11. May 13, 2020
    • Christian Brauner's avatar
      nsproxy: attach to namespaces via pidfds · 303cc571
      Christian Brauner authored
      
      For quite a while we have been thinking about using pidfds to attach to
      namespaces. This patchset has existed for about a year already but we've
      wanted to wait to see how the general api would be received and adopted.
      Now that more and more programs in userspace have started using pidfds
      for process management it's time to send this one out.
      
      This patch makes it possible to use pidfds to attach to the namespaces
      of another process, i.e. they can be passed as the first argument to the
      setns() syscall. When only a single namespace type is specified the
      semantics are equivalent to passing an nsfd. That means
      setns(nsfd, CLONE_NEWNET) equals setns(pidfd, CLONE_NEWNET). However,
      when a pidfd is passed, multiple namespace flags can be specified in the
      second setns() argument and setns() will attach the caller to all the
      specified namespaces all at once or to none of them. Specifying 0 is not
      valid together with a pidfd.
      
      Here are just two obvious examples:
      setns(pidfd, CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET);
      setns(pidfd, CLONE_NEWUSER);
      Allowing to also attach subsets of namespaces supports various use-cases
      where callers setns to a subset of namespaces to retain privilege, perform
      an action and then re-attach another subset of namespaces.
      
      If the need arises, as Eric suggested, we can extend this patchset to
      assume even more context than just attaching all namespaces. His suggestion
      specifically was about assuming the process' root directory when
      setns(pidfd, 0) or setns(pidfd, SETNS_PIDFD) is specified. For now, just
      keep it flexible in terms of supporting subsets of namespaces but let's
      wait until we have users asking for even more context to be assumed. At
      that point we can add an extension.
      
      The obvious example where this is useful is a standard container
      manager interacting with a running container: pushing and pulling files
      or directories, injecting mounts, attaching/execing any kind of process,
      managing network devices all these operations require attaching to all
      or at least multiple namespaces at the same time. Given that nowadays
      most containers are spawned with all namespaces enabled we're currently
      looking at at least 14 syscalls, 7 to open the /proc/<pid>/ns/<ns>
      nsfds, another 7 to actually perform the namespace switch. With time
      namespaces we're looking at about 16 syscalls.
      (We could amortize the first 7 or 8 syscalls for opening the nsfds by
       stashing them in each container's monitor process but that would mean
       we need to send around those file descriptors through unix sockets
       everytime we want to interact with the container or keep on-disk
       state. Even in scenarios where a caller wants to join a particular
       namespace in a particular order callers still profit from batching
       other namespaces. That mostly applies to the user namespace but
       all container runtimes I found join the user namespace first no matter
       if it privileges or deprivileges the container similar to how unshare
       behaves.)
      With pidfds this becomes a single syscall no matter how many namespaces
      are supposed to be attached to.
      
      A decently designed, large-scale container manager usually isn't the
      parent of any of the containers it spawns so the containers don't die
      when it crashes or needs to update or reinitialize. This means that
      for the manager to interact with containers through pids is inherently
      racy especially on systems where the maximum pid number is not
      significicantly bumped. This is even more problematic since we often spawn
      and manage thousands or ten-thousands of containers. Interacting with a
      container through a pid thus can become risky quite quickly. Especially
      since we allow for an administrator to enable advanced features such as
      syscall interception where we're performing syscalls in lieu of the
      container. In all of those cases we use pidfds if they are available and
      we pass them around as stable references. Using them to setns() to the
      target process' namespaces is as reliable as using nsfds. Either the
      target process is already dead and we get ESRCH or we manage to attach
      to its namespaces but we can't accidently attach to another process'
      namespaces. So pidfds lend themselves to be used with this api.
      The other main advantage is that with this change the pidfd becomes the
      only relevant token for most container interactions and it's the only
      token we need to create and send around.
      
      Apart from significiantly reducing the number of syscalls from double
      digit to single digit which is a decent reason post-spectre/meltdown
      this also allows to switch to a set of namespaces atomically, i.e.
      either attaching to all the specified namespaces succeeds or we fail. If
      we fail we haven't changed a single namespace. There are currently three
      namespaces that can fail (other than for ENOMEM which really is not
      very interesting since we then have other problems anyway) for
      non-trivial reasons, user, mount, and pid namespaces. We can fail to
      attach to a pid namespace if it is not our current active pid namespace
      or a descendant of it. We can fail to attach to a user namespace because
      we are multi-threaded or because our current mount namespace shares
      filesystem state with other tasks, or because we're trying to setns()
      to the same user namespace, i.e. the target task has the same user
      namespace as we do. We can fail to attach to a mount namespace because
      it shares filesystem state with other tasks or because we fail to lookup
      the new root for the new mount namespace. In most non-pathological
      scenarios these issues can be somewhat mitigated. But there are cases where
      we're half-attached to some namespace and failing to attach to another one.
      I've talked about some of these problem during the hallway track (something
      only the pre-COVID-19 generation will remember) of Plumbers in Los Angeles
      in 2018(?). Even if all these issues could be avoided with super careful
      userspace coding it would be nicer to have this done in-kernel. Pidfds seem
      to lend themselves nicely for this.
      
      The other neat thing about this is that setns() becomes an actual
      counterpart to the namespace bits of unshare().
      
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarSerge Hallyn <serge@hallyn.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Serge Hallyn <serge@hallyn.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Link: https://lore.kernel.org/r/20200505140432.181565-3-christian.brauner@ubuntu.com
      303cc571
  12. May 09, 2020
    • Christian Brauner's avatar
      nsproxy: add struct nsset · f2a8d52e
      Christian Brauner authored
      
      Add a simple struct nsset. It holds all necessary pieces to switch to a new
      set of namespaces without leaving a task in a half-switched state which we
      will make use of in the next patch. This patch switches the existing setns
      logic over without causing a change in setns() behavior. This brings
      setns() closer to how unshare() works(). The prepare_ns() function is
      responsible to prepare all necessary information. This has two reasons.
      First it minimizes dependencies between individual namespaces, i.e. all
      install handler can expect that all fields are properly initialized
      independent in what order they are called in. Second, this makes the code
      easier to maintain and easier to follow if it needs to be changed.
      
      The prepare_ns() helper will only be switched over to use a flags argument
      in the next patch. Here it will still use nstype as a simple integer
      argument which was argued would be clearer. I'm not particularly
      opinionated about this if it really helps or not. The struct nsset itself
      already contains the flags field since its name already indicates that it
      can contain information required by different namespaces. None of this
      should have functional consequences.
      
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarSerge Hallyn <serge@hallyn.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Serge Hallyn <serge@hallyn.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Link: https://lore.kernel.org/r/20200505140432.181565-2-christian.brauner@ubuntu.com
      f2a8d52e
  13. Jan 14, 2020
    • Andrei Vagin's avatar
      ns: Introduce Time Namespace · 769071ac
      Andrei Vagin authored
      
      Time Namespace isolates clock values.
      
      The kernel provides access to several clocks CLOCK_REALTIME,
      CLOCK_MONOTONIC, CLOCK_BOOTTIME, etc.
      
      CLOCK_REALTIME
            System-wide clock that measures real (i.e., wall-clock) time.
      
      CLOCK_MONOTONIC
            Clock that cannot be set and represents monotonic time since
            some unspecified starting point.
      
      CLOCK_BOOTTIME
            Identical to CLOCK_MONOTONIC, except it also includes any time
            that the system is suspended.
      
      For many users, the time namespace means the ability to changes date and
      time in a container (CLOCK_REALTIME). Providing per namespace notions of
      CLOCK_REALTIME would be complex with a massive overhead, but has a dubious
      value.
      
      But in the context of checkpoint/restore functionality, monotonic and
      boottime clocks become interesting. Both clocks are monotonic with
      unspecified starting points. These clocks are widely used to measure time
      slices and set timers. After restoring or migrating processes, it has to be
      guaranteed that they never go backward. In an ideal case, the behavior of
      these clocks should be the same as for a case when a whole system is
      suspended. All this means that it is required to set CLOCK_MONOTONIC and
      CLOCK_BOOTTIME clocks, which can be achieved by adding per-namespace
      offsets for clocks.
      
      A time namespace is similar to a pid namespace in the way how it is
      created: unshare(CLONE_NEWTIME) system call creates a new time namespace,
      but doesn't set it to the current process. Then all children of the process
      will be born in the new time namespace, or a process can use the setns()
      system call to join a namespace.
      
      This scheme allows setting clock offsets for a namespace, before any
      processes appear in it.
      
      All available clone flags have been used, so CLONE_NEWTIME uses the highest
      bit of CSIGNAL. It means that it can be used only with the unshare() and
      the clone3() system calls.
      
      [ tglx: Adjusted paragraph about clone3() to reality and massaged the
        	changelog a bit. ]
      
      Co-developed-by: default avatarDmitry Safonov <dima@arista.com>
      Signed-off-by: default avatarAndrei Vagin <avagin@gmail.com>
      Signed-off-by: default avatarDmitry Safonov <dima@arista.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Link: https://criu.org/Time_namespace
      Link: https://lists.openvz.org/pipermail/criu/2018-June/041504.html
      Link: https://lore.kernel.org/r/20191112012724.250792-4-dima@arista.com
      
      769071ac
  14. Jun 05, 2019
  15. Mar 13, 2017
    • Hari Bathini's avatar
      perf: Add PERF_RECORD_NAMESPACES to include namespaces related info · e4222673
      Hari Bathini authored
      
      With the advert of container technologies like docker, that depend on
      namespaces for isolation, there is a need for tracing support for
      namespaces. This patch introduces new PERF_RECORD_NAMESPACES event for
      recording namespaces related info. By recording info for every
      namespace, it is left to userspace to take a call on the definition of a
      container and trace containers by updating perf tool accordingly.
      
      Each namespace has a combination of device and inode numbers. Though
      every namespace has the same device number currently, that may change in
      future to avoid the need for a namespace of namespaces. Considering such
      possibility, record both device and inode numbers separately for each
      namespace.
      
      Signed-off-by: default avatarHari Bathini <hbathini@linux.vnet.ibm.com>
      Acked-by: default avatarJiri Olsa <jolsa@kernel.org>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
      Cc: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/148891929686.25309.2827618988917007768.stgit@hbathini.in.ibm.com
      
      
      Signed-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      e4222673
  16. Feb 16, 2016
    • Aditya Kali's avatar
      cgroup: introduce cgroup namespaces · a79a908f
      Aditya Kali authored
      
      Introduce the ability to create new cgroup namespace. The newly created
      cgroup namespace remembers the cgroup of the process at the point
      of creation of the cgroup namespace (referred as cgroupns-root).
      The main purpose of cgroup namespace is to virtualize the contents
      of /proc/self/cgroup file. Processes inside a cgroup namespace
      are only able to see paths relative to their namespace root
      (unless they are moved outside of their cgroupns-root, at which point
       they will see a relative path from their cgroupns-root).
      For a correctly setup container this enables container-tools
      (like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
      containers without leaking system level cgroup hierarchy to the task.
      This patch only implements the 'unshare' part of the cgroupns.
      
      Signed-off-by: default avatarAditya Kali <adityakali@google.com>
      Signed-off-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      a79a908f
  17. Dec 04, 2014
  18. Jul 30, 2014
    • Eric W. Biederman's avatar
      namespaces: Use task_lock and not rcu to protect nsproxy · 728dba3a
      Eric W. Biederman authored
      
      The synchronous syncrhonize_rcu in switch_task_namespaces makes setns
      a sufficiently expensive system call that people have complained.
      
      Upon inspect nsproxy no longer needs rcu protection for remote reads.
      remote reads are rare.  So optimize for same process reads and write
      by switching using rask_lock instead.
      
      This yields a simpler to understand lock, and a faster setns system call.
      
      In particular this fixes a performance regression observed
      by Rafael David Tinoco <rafael.tinoco@canonical.com>.
      
      This is effectively a revert of Pavel Emelyanov's commit
      cf7b708c Make access to task's nsproxy lighter
      from 2007.  The race this originialy fixed no longer exists as
      do_notify_parent uses task_active_pid_ns(parent) instead of
      parent->nsproxy.
      
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      728dba3a
  19. Aug 31, 2013
  20. Aug 27, 2013
  21. May 01, 2013
  22. Feb 23, 2013
  23. Feb 22, 2013
  24. Nov 20, 2012
  25. Nov 19, 2012
    • Eric W. Biederman's avatar
      vfs: Add a user namespace reference from struct mnt_namespace · 771b1371
      Eric W. Biederman authored
      
      This will allow for support for unprivileged mounts in a new user namespace.
      
      Acked-by: default avatar"Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      771b1371
    • Eric W. Biederman's avatar
      pidns: Support unsharing the pid namespace. · 50804fe3
      Eric W. Biederman authored
      
      Unsharing of the pid namespace unlike unsharing of other namespaces
      does not take affect immediately.  Instead it affects the children
      created with fork and clone.  The first of these children becomes the init
      process of the new pid namespace, the rest become oddball children
      of pid 0.  From the point of view of the new pid namespace the process
      that created it is pid 0, as it's pid does not map.
      
      A couple of different semantics were considered but this one was
      settled on because it is easy to implement and it is usable from
      pam modules.  The core reasons for the existence of unshare.
      
      I took a survey of the callers of pam modules and the following
      appears to be a representative sample of their logic.
      {
      	setup stuff include pam
      	child = fork();
      	if (!child) {
      		setuid()
                      exec /bin/bash
              }
              waitpid(child);
      
              pam and other cleanup
      }
      
      As you can see there is a fork to create the unprivileged user
      space process.  Which means that the unprivileged user space
      process will appear as pid 1 in the new pid namespace.  Further
      most login processes do not cope with extraneous children which
      means shifting the duty of reaping extraneous child process to
      the creator of those extraneous children makes the system more
      comprehensible.
      
      The practical reason for this set of pid namespace semantics is
      that it is simple to implement and verify they work correctly.
      Whereas an implementation that requres changing the struct
      pid on a process comes with a lot more races and pain.  Not
      the least of which is that glibc caches getpid().
      
      These semantics are implemented by having two notions
      of the pid namespace of a proces.  There is task_active_pid_ns
      which is the pid namspace the process was created with
      and the pid namespace that all pids are presented to
      that process in.  The task_active_pid_ns is stored
      in the struct pid of the task.
      
      Then there is the pid namespace that will be used for children
      that pid namespace is stored in task->nsproxy->pid_ns.
      
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      50804fe3
    • Eric W. Biederman's avatar
      pidns: Use task_active_pid_ns where appropriate · 17cf22c3
      Eric W. Biederman authored
      
      The expressions tsk->nsproxy->pid_ns and task_active_pid_ns
      aka ns_of_pid(task_pid(tsk)) should have the same number of
      cache line misses with the practical difference that
      ns_of_pid(task_pid(tsk)) is released later in a processes life.
      
      Furthermore by using task_active_pid_ns it becomes trivial
      to write an unshare implementation for the the pid namespace.
      
      So I have used task_active_pid_ns everywhere I can.
      
      In fork since the pid has not yet been attached to the
      process I use ns_of_pid, to achieve the same effect.
      
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      17cf22c3
    • Eric W. Biederman's avatar
      pidns: Capture the user namespace and filter ns_last_pid · 49f4d8b9
      Eric W. Biederman authored
      
      - Capture the the user namespace that creates the pid namespace
      - Use that user namespace to test if it is ok to write to
        /proc/sys/kernel/ns_last_pid.
      
      Zhao Hongjiang <zhaohongjiang@huawei.com> noticed I was missing a put_user_ns
      in when destroying a pid_ns.  I have foloded his patch into this one
      so that bisects will work properly.
      
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      49f4d8b9
    • Eric W. Biederman's avatar
      userns: make each net (net_ns) belong to a user_ns · 038e7332
      Eric W. Biederman authored
      
      The user namespace which creates a new network namespace owns that
      namespace and all resources created in it.  This way we can target
      capability checks for privileged operations against network resources to
      the user_ns which created the network namespace in which the resource
      lives.  Privilege to the user namespace which owns the network
      namespace, or any parent user namespace thereof, provides the same
      privilege to the network resource.
      
      This patch is reworked from a version originally by
      Serge E. Hallyn <serge.hallyn@canonical.com>
      
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      038e7332
    • Eric W. Biederman's avatar
      userns: make each net (net_ns) belong to a user_ns · d328b836
      Eric W. Biederman authored
      
      The user namespace which creates a new network namespace owns that
      namespace and all resources created in it.  This way we can target
      capability checks for privileged operations against network resources to
      the user_ns which created the network namespace in which the resource
      lives.  Privilege to the user namespace which owns the network
      namespace, or any parent user namespace thereof, provides the same
      privilege to the network resource.
      
      This patch is reworked from a version originally by
      Serge E. Hallyn <serge.hallyn@canonical.com>
      
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d328b836
  26. Oct 31, 2011
  27. Jul 20, 2011
  28. May 27, 2011
  29. May 10, 2011
    • Eric W. Biederman's avatar
      ns: Introduce the setns syscall · 0663c6f8
      Eric W. Biederman authored
      
      With the networking stack today there is demand to handle
      multiple network stacks at a time.  Not in the context
      of containers but in the context of people doing interesting
      things with routing.
      
      There is also demand in the context of containers to have
      an efficient way to execute some code in the container itself.
      If nothing else it is very useful ad a debugging technique.
      
      Both problems can be solved by starting some form of login
      daemon in the namespaces people want access to, or you
      can play games by ptracing a process and getting the
      traced process to do things you want it to do. However
      it turns out that a login daemon or a ptrace puppet
      controller are more code, they are more prone to
      failure, and generally they are less efficient than
      simply changing the namespace of a process to a
      specified one.
      
      Pieces of this puzzle can also be solved by instead of
      coming up with a general purpose system call coming up
      with targed system calls perhaps socketat that solve
      a subset of the larger problem.  Overall that appears
      to be more work for less reward.
      
      int setns(int fd, int nstype);
      
      The fd argument is a file descriptor referring to a proc
      file of the namespace you want to switch the process to.
      
      In the setns system call the nstype is 0 or specifies
      an clone flag of the namespace you intend to change
      to prevent changing a namespace unintentionally.
      
      v2: Most of the architecture support added by Daniel Lezcano <dlezcano@fr.ibm.com>
      v3: ported to v2.6.36-rc4 by: Eric W. Biederman <ebiederm@xmission.com>
      v4: Moved wiring up of the system call to another patch
      v5: Cleaned up the system call arguments
          - Changed the order.
          - Modified nstype to take the standard clone flags.
      v6: Added missing error handling as pointed out by Matt Helsley <matthltc@us.ibm.com>
      
      Acked-by: default avatarDaniel Lezcano <daniel.lezcano@free.fr>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      0663c6f8
  30. Mar 24, 2011
Loading