1. 18 Dec, 2018 1 commit
    • Arnd Bergmann's avatar
      y2038: socket: Add compat_sys_recvmmsg_time64 · e11d4284
      Arnd Bergmann authored
      recvmmsg() takes two arguments to pointers of structures that differ
      between 32-bit and 64-bit architectures: mmsghdr and timespec.
      
      For y2038 compatbility, we are changing the native system call from
      timespec to __kernel_timespec with a 64-bit time_t (in another patch),
      and use the existing compat system call on both 32-bit and 64-bit
      architectures for compatibility with traditional 32-bit user space.
      
      As we now have two variants of recvmmsg() for 32-bit tasks that are both
      different from the variant that we use on 64-bit tasks, this means we
      also require two compat system calls!
      
      The solution I picked is to flip things around: The existing
      compat_sys_recvmmsg() call gets moved from net/compat.c into net/socket.c
      and now handles the case for old user space on all architectures that
      have set CONFIG_COMPAT_32BIT_TIME.  A new compat_sys_recvmmsg_time64()
      call gets added in the old place for 64-bit architectures only, this
      one handles the case of a compat mmsghdr structure combined with
      __kernel_timespec.
      
      In the indirect sys_socketcall(), we now need to call either
      do_sys_recvmmsg() or __compat_sys_recvmmsg(), depending on what kind of
      architecture we are on. For compat_sys_socketcall(), no such change is
      needed, we always call __compat_sys_recvmmsg().
      
      I decided to not add a new SYS_RECVMMSG_TIME64 socketcall: Any libc
      implementation for 64-bit time_t will need significant changes including
      an updated asm/unistd.h, and it seems better to consistently use the
      separate syscalls that configuration, leaving the socketcall only for
      backward compatibility with 32-bit time_t based libc.
      
      The naming is asymmetric for the moment, so both existing syscalls
      entry points keep their names, while the new ones are recvmmsg_time32
      and compat_recvmmsg_time64 respectively. I expect that we will rename
      the compat syscalls later as we start using generated syscall tables
      everywhere and add these entry points.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      e11d4284
  2. 06 Jun, 2018 1 commit
    • Mathieu Desnoyers's avatar
      rseq: Introduce restartable sequences system call · d7822b1e
      Mathieu Desnoyers authored
      Expose a new system call allowing each thread to register one userspace
      memory area to be used as an ABI between kernel and user-space for two
      purposes: user-space restartable sequences and quick access to read the
      current CPU number value from user-space.
      
      * Restartable sequences (per-cpu atomics)
      
      Restartables sequences allow user-space to perform update operations on
      per-cpu data without requiring heavy-weight atomic operations.
      
      The restartable critical sections (percpu atomics) work has been started
      by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
      critical sections. [1] [2] The re-implementation proposed here brings a
      few simplifications to the ABI which facilitates porting to other
      architectures and speeds up the user-space fast path.
      
      Here are benchmarks of various rseq use-cases.
      
      Test hardware:
      
      arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
      x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
      
      The following benchmarks were all performed on a single thread.
      
      * Per-CPU statistic counter increment
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                344.0                 31.4          11.0
      x86-64:                15.3                  2.0           7.7
      
      * LTTng-UST: write event 32-bit header, 32-bit payload into tracer
                   per-cpu buffer
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:               2502.0                 2250.0         1.1
      x86-64:               117.4                   98.0         1.2
      
      * liburcu percpu: lock-unlock pair, dereference, read/compare word
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                751.0                 128.5          5.8
      x86-64:                53.4                  28.6          1.9
      
      * jemalloc memory allocator adapted to use rseq
      
      Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
      rseq 2016 implementation):
      
      The production workload response-time has 1-2% gain avg. latency, and
      the P99 overall latency drops by 2-3%.
      
      * Reading the current CPU number
      
      Speeding up reading the current CPU number on which the caller thread is
      running is done by keeping the current CPU number up do date within the
      cpu_id field of the memory area registered by the thread. This is done
      by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
      current thread. Upon return to user-space, a notify-resume handler
      updates the current CPU value within the registered user-space memory
      area. User-space can then read the current CPU number directly from
      memory.
      
      Keeping the current cpu id in a memory area shared between kernel and
      user-space is an improvement over current mechanisms available to read
      the current CPU number, which has the following benefits over
      alternative approaches:
      
      - 35x speedup on ARM vs system call through glibc
      - 20x speedup on x86 compared to calling glibc, which calls vdso
        executing a "lsl" instruction,
      - 14x speedup on x86 compared to inlined "lsl" instruction,
      - Unlike vdso approaches, this cpu_id value can be read from an inline
        assembly, which makes it a useful building block for restartable
        sequences.
      - The approach of reading the cpu id through memory mapping shared
        between kernel and user-space is portable (e.g. ARM), which is not the
        case for the lsl-based x86 vdso.
      
      On x86, yet another possible approach would be to use the gs segment
      selector to point to user-space per-cpu data. This approach performs
      similarly to the cpu id cache, but it has two disadvantages: it is
      not portable, and it is incompatible with existing applications already
      using the gs segment selector for other purposes.
      
      Benchmarking various approaches for reading the current CPU number:
      
      ARMv7 Processor rev 4 (v7l)
      Machine model: Cubietruck
      - Baseline (empty loop):                                    8.4 ns
      - Read CPU from rseq cpu_id:                               16.7 ns
      - Read CPU from rseq cpu_id (lazy register):               19.8 ns
      - glibc 2.19-0ubuntu6.6 getcpu:                           301.8 ns
      - getcpu system call:                                     234.9 ns
      
      x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
      - Baseline (empty loop):                                    0.8 ns
      - Read CPU from rseq cpu_id:                                0.8 ns
      - Read CPU from rseq cpu_id (lazy register):                0.8 ns
      - Read using gs segment selector:                           0.8 ns
      - "lsl" inline assembly:                                   13.0 ns
      - glibc 2.19-0ubuntu6 getcpu:                              16.6 ns
      - getcpu system call:                                      53.9 ns
      
      - Speed (benchmark taken on v8 of patchset)
      
      Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
      expectations, that enabling CONFIG_RSEQ slightly accelerates the
      scheduler:
      
      Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
      2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
      saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
      kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
      restartable sequences series applied.
      
      * CONFIG_RSEQ=n
      
      avg.:      41.37 s
      std.dev.:   0.36 s
      
      * CONFIG_RSEQ=y
      
      avg.:      40.46 s
      std.dev.:   0.33 s
      
      - Size
      
      On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
      567 bytes, and the data size increase of vmlinux is 5696 bytes.
      
      [1] https://lwn.net/Articles/650333/
      [2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdfSigned-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-api@vger.kernel.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
      Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
      Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
      d7822b1e
  3. 10 May, 2018 1 commit
  4. 02 May, 2018 1 commit
    • Christoph Hellwig's avatar
      aio: implement io_pgetevents · 7a074e96
      Christoph Hellwig authored
      This is the io_getevents equivalent of ppoll/pselect and allows to
      properly mix signals and aio completions (especially with IOCB_CMD_POLL)
      and atomically executes the following sequence:
      
      	sigset_t origmask;
      
      	pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
      	ret = io_getevents(ctx, min_nr, nr, events, timeout);
      	pthread_sigmask(SIG_SETMASK, &origmask, NULL);
      
      Note that unlike many other signal related calls we do not pass a sigmask
      size, as that would get us to 7 arguments, which aren't easily supported
      by the syscall infrastructure.  It seems a lot less painful to just add a
      new syscall variant in the unlikely case we're going to increase the
      sigset size.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      7a074e96
  5. 05 Apr, 2018 1 commit
    • Dominik Brodowski's avatar
      syscalls/core: Prepare CONFIG_ARCH_HAS_SYSCALL_WRAPPER=y for compat syscalls · 7303e30e
      Dominik Brodowski authored
      It may be useful for an architecture to override the definitions of the
      COMPAT_SYSCALL_DEFINE0() and __COMPAT_SYSCALL_DEFINEx() macros in
      <linux/compat.h>, in particular to use a different calling convention
      for syscalls. This patch provides a mechanism to do so, based on the
      previously introduced CONFIG_ARCH_HAS_SYSCALL_WRAPPER. If it is enabled,
      <asm/sycall_wrapper.h> is included in <linux/compat.h> and may be used
      to define the macros mentioned above. Moreover, as the syscall calling
      convention may be different if CONFIG_ARCH_HAS_SYSCALL_WRAPPER is set,
      the compat syscall function prototypes in <linux/compat.h> are #ifndef'd
      out in that case.
      
      As some of the syscalls and/or compat syscalls may not be present,
      the COND_SYSCALL() and COND_SYSCALL_COMPAT() macros in kernel/sys_ni.c
      as well as the SYS_NI() and COMPAT_SYS_NI() macros in
      kernel/time/posix-stubs.c can be re-defined in <asm/syscall_wrapper.h> iff
      CONFIG_ARCH_HAS_SYSCALL_WRAPPER is enabled.
      Signed-off-by: default avatarDominik Brodowski <linux@dominikbrodowski.net>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180405095307.3730-5-linux@dominikbrodowski.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      7303e30e
  6. 02 Apr, 2018 3 commits
  7. 02 Nov, 2017 1 commit
    • Greg Kroah-Hartman's avatar
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman authored
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: default avatarKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: default avatarPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  8. 23 Dec, 2016 1 commit
    • Al Viro's avatar
      move aio compat to fs/aio.c · c00d2c7e
      Al Viro authored
      ... and fix the minor buglet in compat io_submit() - native one
      kills ioctx as cleanup when put_user() fails.  Get rid of
      bogus compat_... in !CONFIG_AIO case, while we are at it - they
      should simply fail with ENOSYS, same as for native counterparts.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      c00d2c7e
  9. 13 Sep, 2016 1 commit
  10. 01 Dec, 2015 1 commit
    • Zach Brown's avatar
      vfs: add copy_file_range syscall and vfs helper · 29732938
      Zach Brown authored
      Add a copy_file_range() system call for offloading copies between
      regular files.
      
      This gives an interface to underlying layers of the storage stack which
      can copy without reading and writing all the data.  There are a few
      candidates that should support copy offloading in the nearer term:
      
      - btrfs shares extent references with its clone ioctl
      - NFS has patches to add a COPY command which copies on the server
      - SCSI has a family of XCOPY commands which copy in the device
      
      This system call avoids the complexity of also accelerating the creation
      of the destination file by operating on an existing destination file
      descriptor, not a path.
      
      Currently the high level vfs entry point limits copy offloading to files
      on the same mount and super (and not in the same file).  This can be
      relaxed if we get implementations which can copy between file systems
      safely.
      Signed-off-by: default avatarZach Brown <zab@redhat.com>
      [Anna Schumaker: Change -EINVAL to -EBADF during file verification,
                       Change flags parameter from int to unsigned int,
                       Add function to include/linux/syscalls.h,
                       Check copy len after file open mode,
                       Don't forbid ranges inside the same file,
                       Use rw_verify_area() to veriy ranges,
                       Use file_out rather than file_in,
                       Add COPY_FR_REFLINK flag]
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      29732938
  11. 06 Nov, 2015 1 commit
  12. 11 Sep, 2015 1 commit
    • Mathieu Desnoyers's avatar
      sys_membarrier(): system-wide memory barrier (generic, x86) · 5b25b13a
      Mathieu Desnoyers authored
      Here is an implementation of a new system call, sys_membarrier(), which
      executes a memory barrier on all threads running on the system.  It is
      implemented by calling synchronize_sched().  It can be used to
      distribute the cost of user-space memory barriers asymmetrically by
      transforming pairs of memory barriers into pairs consisting of
      sys_membarrier() and a compiler barrier.  For synchronization primitives
      that distinguish between read-side and write-side (e.g.  userspace RCU
      [1], rwlocks), the read-side can be accelerated significantly by moving
      the bulk of the memory barrier overhead to the write-side.
      
      The existing applications of which I am aware that would be improved by
      this system call are as follows:
      
      * Through Userspace RCU library (http://urcu.so)
        - DNS server (Knot DNS) https://www.knot-dns.cz/
        - Network sniffer (http://netsniff-ng.org/)
        - Distributed object storage (https://sheepdog.github.io/sheepdog/)
        - User-space tracing (http://lttng.org)
        - Network storage system (https://www.gluster.org/)
        - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
        - Financial software (https://lkml.org/lkml/2015/3/23/189)
      
      Those projects use RCU in userspace to increase read-side speed and
      scalability compared to locking.  Especially in the case of RCU used by
      libraries, sys_membarrier can speed up the read-side by moving the bulk of
      the memory barrier cost to synchronize_rcu().
      
      * Direct users of sys_membarrier
        - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)
      
      Microsoft core dotnet GC developers are planning to use the mprotect()
      side-effect of issuing memory barriers through IPIs as a way to implement
      Windows FlushProcessWriteBuffers() on Linux.  They are referring to
      sys_membarrier in their github thread, specifically stating that
      sys_membarrier() is what they are looking for.
      
      To explain the benefit of this scheme, let's introduce two example threads:
      
      Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
      Thread B (frequent, e.g. executing liburcu
      rcu_read_lock()/rcu_read_unlock())
      
      In a scheme where all smp_mb() in thread A are ordering memory accesses
      with respect to smp_mb() present in Thread B, we can change each
      smp_mb() within Thread A into calls to sys_membarrier() and each
      smp_mb() within Thread B into compiler barriers "barrier()".
      
      Before the change, we had, for each smp_mb() pairs:
      
      Thread A                    Thread B
      previous mem accesses       previous mem accesses
      smp_mb()                    smp_mb()
      following mem accesses      following mem accesses
      
      After the change, these pairs become:
      
      Thread A                    Thread B
      prev mem accesses           prev mem accesses
      sys_membarrier()            barrier()
      follow mem accesses         follow mem accesses
      
      As we can see, there are two possible scenarios: either Thread B memory
      accesses do not happen concurrently with Thread A accesses (1), or they
      do (2).
      
      1) Non-concurrent Thread A vs Thread B accesses:
      
      Thread A                    Thread B
      prev mem accesses
      sys_membarrier()
      follow mem accesses
                                  prev mem accesses
                                  barrier()
                                  follow mem accesses
      
      In this case, thread B accesses will be weakly ordered. This is OK,
      because at that point, thread A is not particularly interested in
      ordering them with respect to its own accesses.
      
      2) Concurrent Thread A vs Thread B accesses
      
      Thread A                    Thread B
      prev mem accesses           prev mem accesses
      sys_membarrier()            barrier()
      follow mem accesses         follow mem accesses
      
      In this case, thread B accesses, which are ensured to be in program
      order thanks to the compiler barrier, will be "upgraded" to full
      smp_mb() by synchronize_sched().
      
      * Benchmarks
      
      On Intel Xeon E5405 (8 cores)
      (one thread is calling sys_membarrier, the other 7 threads are busy
      looping)
      
      1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.
      
      * User-space user of this system call: Userspace RCU library
      
      Both the signal-based and the sys_membarrier userspace RCU schemes
      permit us to remove the memory barrier from the userspace RCU
      rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
      accelerating them. These memory barriers are replaced by compiler
      barriers on the read-side, and all matching memory barriers on the
      write-side are turned into an invocation of a memory barrier on all
      active threads in the process. By letting the kernel perform this
      synchronization rather than dumbly sending a signal to every process
      threads (as we currently do), we diminish the number of unnecessary wake
      ups and only issue the memory barriers on active threads. Non-running
      threads do not need to execute such barrier anyway, because these are
      implied by the scheduler context switches.
      
      Results in liburcu:
      
      Operations in 10s, 6 readers, 2 writers:
      
      memory barriers in reader:    1701557485 reads, 2202847 writes
      signal-based scheme:          9830061167 reads,    6700 writes
      sys_membarrier:               9952759104 reads,     425 writes
      sys_membarrier (dyn. check):  7970328887 reads,     425 writes
      
      The dynamic sys_membarrier availability check adds some overhead to
      the read-side compared to the signal-based scheme, but besides that,
      sys_membarrier slightly outperforms the signal-based scheme. However,
      this non-expedited sys_membarrier implementation has a much slower grace
      period than signal and memory barrier schemes.
      
      Besides diminishing the number of wake-ups, one major advantage of the
      membarrier system call over the signal-based scheme is that it does not
      need to reserve a signal. This plays much more nicely with libraries,
      and with processes injected into for tracing purposes, for which we
      cannot expect that signals will be unused by the application.
      
      An expedited version of this system call can be added later on to speed
      up the grace period. Its implementation will likely depend on reading
      the cpu_curr()->mm without holding each CPU's rq lock.
      
      This patch adds the system call to x86 and to asm-generic.
      
      [1] http://urcu.so
      
      membarrier(2) man page:
      
      MEMBARRIER(2)              Linux Programmer's Manual             MEMBARRIER(2)
      
      NAME
             membarrier - issue memory barriers on a set of threads
      
      SYNOPSIS
             #include <linux/membarrier.h>
      
             int membarrier(int cmd, int flags);
      
      DESCRIPTION
             The cmd argument is one of the following:
      
             MEMBARRIER_CMD_QUERY
                    Query  the  set  of  supported commands. It returns a bitmask of
                    supported commands.
      
             MEMBARRIER_CMD_SHARED
                    Execute a memory barrier on all threads running on  the  system.
                    Upon  return from system call, the caller thread is ensured that
                    all running threads have passed through a state where all memory
                    accesses  to  user-space  addresses  match program order between
                    entry to and return from the system  call  (non-running  threads
                    are de facto in such a state). This covers threads from all pro=E2=80=90
                    cesses running on the system.  This command returns 0.
      
             The flags argument needs to be 0. For future extensions.
      
             All memory accesses performed  in  program  order  from  each  targeted
             thread is guaranteed to be ordered with respect to sys_membarrier(). If
             we use the semantic "barrier()" to represent a compiler barrier forcing
             memory  accesses  to  be performed in program order across the barrier,
             and smp_mb() to represent explicit memory barriers forcing full  memory
             ordering  across  the barrier, we have the following ordering table for
             each pair of barrier(), sys_membarrier() and smp_mb():
      
             The pair ordering is detailed as (O: ordered, X: not ordered):
      
                                    barrier()   smp_mb() sys_membarrier()
                    barrier()          X           X            O
                    smp_mb()           X           O            O
                    sys_membarrier()   O           O            O
      
      RETURN VALUE
             On success, these system calls return zero.  On error, -1 is  returned,
             and errno is set appropriately. For a given command, with flags
             argument set to 0, this system call is guaranteed to always return the
             same value until reboot.
      
      ERRORS
             ENOSYS System call is not implemented.
      
             EINVAL Invalid arguments.
      
      Linux                             2015-04-15                     MEMBARRIER(2)
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: Josh Triplett's avatarJosh Triplett <josh@joshtriplett.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Nicholas Miell <nmiell@comcast.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Pranith Kumar <bobby.prani@gmail.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b25b13a
  13. 04 Sep, 2015 1 commit
  14. 31 Jul, 2015 1 commit
    • Andy Lutomirski's avatar
      x86/ldt: Make modify_ldt() optional · a5b9e5a2
      Andy Lutomirski authored
      The modify_ldt syscall exposes a large attack surface and is
      unnecessary for modern userspace.  Make it optional.
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jan Beulich <jbeulich@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: security@kernel.org <security@kernel.org>
      Cc: xen-devel <xen-devel@lists.xen.org>
      Link: http://lkml.kernel.org/r/a605166a771c343fd64802dece77a903507333bd.1438291540.git.luto@kernel.org
      [ Made MATH_EMULATION dependent on MODIFY_LDT_SYSCALL. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      a5b9e5a2
  15. 15 Apr, 2015 1 commit
    • Iulia Manda's avatar
      kernel: conditionally support non-root users, groups and capabilities · 2813893f
      Iulia Manda authored
      There are a lot of embedded systems that run most or all of their
      functionality in init, running as root:root.  For these systems,
      supporting multiple users is not necessary.
      
      This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for
      non-root users, non-root groups, and capabilities optional.  It is enabled
      under CONFIG_EXPERT menu.
      
      When this symbol is not defined, UID and GID are zero in any possible case
      and processes always have all capabilities.
      
      The following syscalls are compiled out: setuid, setregid, setgid,
      setreuid, setresuid, getresuid, setresgid, getresgid, setgroups,
      getgroups, setfsuid, setfsgid, capget, capset.
      
      Also, groups.c is compiled out completely.
      
      In kernel/capability.c, capable function was moved in order to avoid
      adding two ifdef blocks.
      
      This change saves about 25 KB on a defconfig build.  The most minimal
      kernels have total text sizes in the high hundreds of kB rather than
      low MB.  (The 25k goes down a bit with allnoconfig, but not that much.
      
      The kernel was booted in Qemu.  All the common functionalities work.
      Adding users/groups is not possible, failing with -ENOSYS.
      
      Bloat-o-meter output:
      add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650)
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarIulia Manda <iulia.manda21@gmail.com>
      Reviewed-by: Josh Triplett's avatarJosh Triplett <josh@joshtriplett.org>
      Acked-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Tested-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2813893f
  16. 13 Dec, 2014 1 commit
    • David Drysdale's avatar
      syscalls: implement execveat() system call · 51f39a1f
      David Drysdale authored
      This patchset adds execveat(2) for x86, and is derived from Meredydd
      Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
      
      The primary aim of adding an execveat syscall is to allow an
      implementation of fexecve(3) that does not rely on the /proc filesystem,
      at least for executables (rather than scripts).  The current glibc version
      of fexecve(3) is implemented via /proc, which causes problems in sandboxed
      or otherwise restricted environments.
      
      Given the desire for a /proc-free fexecve() implementation, HPA suggested
      (https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
      an appropriate generalization.
      
      Also, having a new syscall means that it can take a flags argument without
      back-compatibility concerns.  The current implementation just defines the
      AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
      added in future -- for example, flags for new namespaces (as suggested at
      https://lkml.org/lkml/2006/7/11/474).
      
      Related history:
       - https://lkml.org/lkml/2006/12/27/123 is an example of someone
         realizing that fexecve() is likely to fail in a chroot environment.
       - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
         documenting the /proc requirement of fexecve(3) in its manpage, to
         "prevent other people from wasting their time".
       - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
         problem where a process that did setuid() could not fexecve()
         because it no longer had access to /proc/self/fd; this has since
         been fixed.
      
      This patch (of 4):
      
      Add a new execveat(2) system call.  execveat() is to execve() as openat()
      is to open(): it takes a file descriptor that refers to a directory, and
      resolves the filename relative to that.
      
      In addition, if the filename is empty and AT_EMPTY_PATH is specified,
      execveat() executes the file to which the file descriptor refers.  This
      replicates the functionality of fexecve(), which is a system call in other
      UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
      so relies on /proc being mounted).
      
      The filename fed to the executed program as argv[0] (or the name of the
      script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
      (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
      reflecting how the executable was found.  This does however mean that
      execution of a script in a /proc-less environment won't work; also, script
      execution via an O_CLOEXEC file descriptor fails (as the file will not be
      accessible after exec).
      
      Based on patches by Meredydd Luff.
      Signed-off-by: default avatarDavid Drysdale <drysdale@google.com>
      Cc: Meredydd Luff <meredydd@senatehouse.org>
      Cc: Shuah Khan <shuah.kh@samsung.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Rich Felker <dalias@aerifal.cx>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      51f39a1f
  17. 19 Nov, 2014 1 commit
  18. 26 Sep, 2014 1 commit
  19. 18 Aug, 2014 1 commit
    • Josh Triplett's avatar
      mm: Support compiling out madvise and fadvise · d3ac21ca
      Josh Triplett authored
      Many embedded systems will not need these syscalls, and omitting them
      saves space.  Add a new EXPERT config option CONFIG_ADVISE_SYSCALLS
      (default y) to support compiling them out.
      
      bloat-o-meter:
      add/remove: 0/3 grow/shrink: 0/0 up/down: 0/-2250 (-2250)
      function                                     old     new   delta
      sys_fadvise64                                 57       -     -57
      sys_fadvise64_64                             691       -    -691
      sys_madvise                                 1502       -   -1502
      Signed-off-by: Josh Triplett's avatarJosh Triplett <josh@joshtriplett.org>
      d3ac21ca
  20. 08 Aug, 2014 2 commits
    • Vivek Goyal's avatar
      kexec: new syscall kexec_file_load() declaration · f0895685
      Vivek Goyal authored
      This is the new syscall kexec_file_load() declaration/interface.  I have
      reserved the syscall number only for x86_64 so far.  Other architectures
      (including i386) can reserve syscall number when they enable the support
      for this new syscall.
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Greg Kroah-Hartman <greg@kroah.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: WANG Chao <chaowang@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f0895685
    • David Rheinsberg's avatar
      shm: add memfd_create() syscall · 9183df25
      David Rheinsberg authored
      memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
      that you can pass to mmap().  It can support sealing and avoids any
      connection to user-visible mount-points.  Thus, it's not subject to quotas
      on mounted file-systems, but can be used like malloc()'ed memory, but with
      a file-descriptor to it.
      
      memfd_create() returns the raw shmem file, so calls like ftruncate() can
      be used to modify the underlying inode.  Also calls like fstat() will
      return proper information and mark the file as regular file.  If you want
      sealing, you can specify MFD_ALLOW_SEALING.  Otherwise, sealing is not
      supported (like on all other regular files).
      
      Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
      subject to a filesystem size limit.  It is still properly accounted to
      memcg limits, though, and to the same overcommit or no-overcommit
      accounting as all user memory.
      Signed-off-by: David Rheinsberg's avatarDavid Herrmann <dh.herrmann@gmail.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ryan Lortie <desrt@desrt.ca>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Daniel Mack <zonque@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9183df25
  21. 18 Jul, 2014 1 commit
    • Kees Cook's avatar
      seccomp: add "seccomp" syscall · 48dc92b9
      Kees Cook authored
      This adds the new "seccomp" syscall with both an "operation" and "flags"
      parameter for future expansion. The third argument is a pointer value,
      used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must
      be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...).
      
      In addition to the TSYNC flag later in this patch series, there is a
      non-zero chance that this syscall could be used for configuring a fixed
      argument area for seccomp-tracer-aware processes to pass syscall arguments
      in the future. Hence, the use of "seccomp" not simply "seccomp_add_filter"
      for this syscall. Additionally, this syscall uses operation, flags,
      and user pointer for arguments because strictly passing arguments via
      a user pointer would mean seccomp itself would be unable to trivially
      filter the seccomp syscall itself.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatarAndy Lutomirski <luto@amacapital.net>
      48dc92b9
  22. 04 Jun, 2014 1 commit
    • Fabian Frederick's avatar
      sys_sgetmask/sys_ssetmask: add CONFIG_SGETMASK_SYSCALL · f6187769
      Fabian Frederick authored
      sys_sgetmask and sys_ssetmask are obsolete system calls no longer
      supported in libc.
      
      This patch replaces architecture related __ARCH_WANT_SYS_SGETMAX by expert
      mode configuration.That option is enabled by default for those
      architectures.
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Cc: Steven Miao <realmz6@gmail.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Greg Ungerer <gerg@uclinux.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f6187769
  23. 03 Apr, 2014 2 commits
  24. 09 May, 2013 1 commit
  25. 04 Mar, 2013 2 commits
  26. 14 Dec, 2012 1 commit
    • Kees Cook's avatar
      module: add syscall to load module from fd · 34e1169d
      Kees Cook authored
      As part of the effort to create a stronger boundary between root and
      kernel, Chrome OS wants to be able to enforce that kernel modules are
      being loaded only from our read-only crypto-hash verified (dm_verity)
      root filesystem. Since the init_module syscall hands the kernel a module
      as a memory blob, no reasoning about the origin of the blob can be made.
      
      Earlier proposals for appending signatures to kernel modules would not be
      useful in Chrome OS, since it would involve adding an additional set of
      keys to our kernel and builds for no good reason: we already trust the
      contents of our root filesystem. We don't need to verify those kernel
      modules a second time. Having to do signature checking on module loading
      would slow us down and be redundant. All we need to know is where a
      module is coming from so we can say yes/no to loading it.
      
      If a file descriptor is used as the source of a kernel module, many more
      things can be reasoned about. In Chrome OS's case, we could enforce that
      the module lives on the filesystem we expect it to live on.  In the case
      of IMA (or other LSMs), it would be possible, for example, to examine
      extended attributes that may contain signatures over the contents of
      the module.
      
      This introduces a new syscall (on x86), similar to init_module, that has
      only two arguments. The first argument is used as a file descriptor to
      the module and the second argument is a pointer to the NULL terminated
      string of module arguments.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (merge fixes)
      34e1169d
  27. 01 Jun, 2012 1 commit
    • Cyrill Gorcunov's avatar
      syscalls, x86: add __NR_kcmp syscall · d97b46a6
      Cyrill Gorcunov authored
      While doing the checkpoint-restore in the user space one need to determine
      whether various kernel objects (like mm_struct-s of file_struct-s) are
      shared between tasks and restore this state.
      
      The 2nd step can be solved by using appropriate CLONE_ flags and the
      unshare syscall, while there's currently no ways for solving the 1st one.
      
      One of the ways for checking whether two tasks share e.g.  mm_struct is to
      provide some mm_struct ID of a task to its proc file, but showing such
      info considered to be not that good for security reasons.
      
      Thus after some debates we end up in conclusion that using that named
      'comparison' syscall might be the best candidate.  So here is it --
      __NR_kcmp.
      
      It takes up to 5 arguments - the pids of the two tasks (which
      characteristics should be compared), the comparison type and (in case of
      comparison of files) two file descriptors.
      
      Lookups for pids are done in the caller's PID namespace only.
      
      At moment only x86 is supported and tested.
      
      [akpm@linux-foundation.org: fix up selftests, warnings]
      [akpm@linux-foundation.org: include errno.h]
      [akpm@linux-foundation.org: tweak comment text]
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Valdis.Kletnieks@vt.edu
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d97b46a6
  28. 01 Nov, 2011 1 commit
    • Christopher Yeoh's avatar
      Cross Memory Attach · fcf63409
      Christopher Yeoh authored
      The basic idea behind cross memory attach is to allow MPI programs doing
      intra-node communication to do a single copy of the message rather than a
      double copy of the message via shared memory.
      
      The following patch attempts to achieve this by allowing a destination
      process, given an address and size from a source process, to copy memory
      directly from the source process into its own address space via a system
      call.  There is also a symmetrical ability to copy from the current
      process's address space into a destination process's address space.
      
      - Use of /proc/pid/mem has been considered, but there are issues with
        using it:
        - Does not allow for specifying iovecs for both src and dest, assuming
          preadv or pwritev was implemented either the area read from or
        written to would need to be contiguous.
        - Currently mem_read allows only processes who are currently
        ptrace'ing the target and are still able to ptrace the target to read
        from the target. This check could possibly be moved to the open call,
        but its not clear exactly what race this restriction is stopping
        (reason  appears to have been lost)
        - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
        domain socket is a bit ugly from a userspace point of view,
        especially when you may have hundreds if not (eventually) thousands
        of processes  that all need to do this with each other
        - Doesn't allow for some future use of the interface we would like to
        consider adding in the future (see below)
        - Interestingly reading from /proc/pid/mem currently actually
        involves two copies! (But this could be fixed pretty easily)
      
      As mentioned previously use of vmsplice instead was considered, but has
      problems.  Since you need the reader and writer working co-operatively if
      the pipe is not drained then you block.  Which requires some wrapping to
      do non blocking on the send side or polling on the receive.  In all to all
      communication it requires ordering otherwise you can deadlock.  And in the
      example of many MPI tasks writing to one MPI task vmsplice serialises the
      copying.
      
      There are some cases of MPI collectives where even a single copy interface
      does not get us the performance gain we could.  For example in an
      MPI_Reduce rather than copy the data from the source we would like to
      instead use it directly in a mathops (say the reduce is doing a sum) as
      this would save us doing a copy.  We don't need to keep a copy of the data
      from the source.  I haven't implemented this, but I think this interface
      could in the future do all this through the use of the flags - eg could
      specify the math operation and type and the kernel rather than just
      copying the data would apply the specified operation between the source
      and destination and store it in the destination.
      
      Although we don't have a "second user" of the interface (though I've had
      some nibbles from people who may be interested in using it for intra
      process messaging which is not MPI).  This interface is something which
      hardware vendors are already doing for their custom drivers to implement
      fast local communication.  And so in addition to this being useful for
      OpenMPI it would mean the driver maintainers don't have to fix things up
      when the mm changes.
      
      There was some discussion about how much faster a true zero copy would
      go. Here's a link back to the email with some testing I did on that:
      
      http://marc.info/?l=linux-mm&m=130105930902915&w=2
      
      There is a basic man page for the proposed interface here:
      
      http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt
      
      This has been implemented for x86 and powerpc, other architecture should
      mainly (I think) just need to add syscall numbers for the process_vm_readv
      and process_vm_writev. There are 32 bit compatibility versions for
      64-bit kernels.
      
      For arch maintainers there are some simple tests to be able to quickly
      verify that the syscalls are working correctly here:
      
      http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgzSigned-off-by: default avatarChris Yeoh <yeohc@au1.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: <linux-man@vger.kernel.org>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fcf63409
  29. 26 Aug, 2011 1 commit
  30. 20 May, 2011 1 commit
  31. 05 May, 2011 1 commit
    • Anton Blanchard's avatar
      net: Add sendmmsg socket system call · 228e548e
      Anton Blanchard authored
      This patch adds a multiple message send syscall and is the send
      version of the existing recvmmsg syscall. This is heavily
      based on the patch by Arnaldo that added recvmmsg.
      
      I wrote a microbenchmark to test the performance gains of using
      this new syscall:
      
      http://ozlabs.org/~anton/junkcode/sendmmsg_test.c
      
      The test was run on a ppc64 box with a 10 Gbit network card. The
      benchmark can send both UDP and RAW ethernet packets.
      
      64B UDP
      
      batch   pkts/sec
      1       804570
      2       872800 (+ 8 %)
      4       916556 (+14 %)
      8       939712 (+17 %)
      16      952688 (+18 %)
      32      956448 (+19 %)
      64      964800 (+20 %)
      
      64B raw socket
      
      batch   pkts/sec
      1       1201449
      2       1350028 (+12 %)
      4       1461416 (+22 %)
      8       1513080 (+26 %)
      16      1541216 (+28 %)
      32      1553440 (+29 %)
      64      1557888 (+30 %)
      
      We see a 20% improvement in throughput on UDP send and 30%
      on raw socket send.
      
      [ Add sparc syscall entries. -DaveM ]
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      228e548e
  32. 15 Mar, 2011 2 commits
  33. 23 Sep, 2010 1 commit
  34. 28 Jul, 2010 1 commit