Skip to content
Snippets Groups Projects
  1. Sep 22, 2021
  2. Sep 14, 2021
    • Linus Torvalds's avatar
      memblock: introduce saner 'memblock_free_ptr()' interface · 77e02cf5
      Linus Torvalds authored
      The boot-time allocation interface for memblock is a mess, with
      'memblock_alloc()' returning a virtual pointer, but then you are
      supposed to free it with 'memblock_free()' that takes a _physical_
      address.
      
      Not only is that all kinds of strange and illogical, but it actually
      causes bugs, when people then use it like a normal allocation function,
      and it fails spectacularly on a NULL pointer:
      
         https://lore.kernel.org/all/20210912140820.GD25450@xsang-OptiPlex-9020/
      
      or just random memory corruption if the debug checks don't catch it:
      
         https://lore.kernel.org/all/61ab2d0c-3313-aaab-514c-e15b7aa054a0@suse.cz/
      
      
      
      I really don't want to apply patches that treat the symptoms, when the
      fundamental cause is this horribly confusing interface.
      
      I started out looking at just automating a sane replacement sequence,
      but because of this mix or virtual and physical addresses, and because
      people have used the "__pa()" macro that can take either a regular
      kernel pointer, or just the raw "unsigned long" address, it's all quite
      messy.
      
      So this just introduces a new saner interface for freeing a virtual
      address that was allocated using 'memblock_alloc()', and that was kept
      as a regular kernel pointer.  And then it converts a couple of users
      that are obvious and easy to test, including the 'xbc_nodes' case in
      lib/bootconfig.c that caused problems.
      
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Fixes: 40caa127 ("init: bootconfig: Remove all bootconfig data when the init memory is removed")
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      77e02cf5
  3. Sep 08, 2021
    • Masami Hiramatsu's avatar
      init/bootconfig: Reorder init parameter from bootconfig and cmdline · b66fbbe8
      Masami Hiramatsu authored
      Reorder the init parameters from bootconfig and kernel cmdline
      so that the kernel cmdline always be the last part of the
      parameters as below.
      
       " -- "[bootconfig init params][cmdline init params]
      
      This change will help us to prevent that bootconfig init params
      overwrite the init params which user gives in the command line.
      
      Link: https://lkml.kernel.org/r/163077085675.222577.5665176468023636160.stgit@devnote2
      
      
      
      Signed-off-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      b66fbbe8
    • Masami Hiramatsu's avatar
      init: bootconfig: Remove all bootconfig data when the init memory is removed · 40caa127
      Masami Hiramatsu authored
      Since the bootconfig is used only in the init functions,
      it doesn't need to keep the data after boot. Free it when
      the init memory is removed.
      
      Link: https://lkml.kernel.org/r/163077084958.222577.5924961258513004428.stgit@devnote2
      
      
      
      Signed-off-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      40caa127
    • Kefeng Wang's avatar
      trap: cleanup trap_init() · 8b097881
      Kefeng Wang authored
      There are some empty trap_init() definitions in different ARCHs, Introduce
      a new weak trap_init() function to clean them up.
      
      Link: https://lkml.kernel.org/r/20210812123602.76356-1-wangkefeng.wang@huawei.com
      
      
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>	[arm32]
      Acked-by: Vineet Gupta						[arc]
      Acked-by: Michael Ellerman <mpe@ellerman.id.au>			[powerpc]
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <palmerdabbelt@google.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b097881
    • Rasmus Villemoes's avatar
      init: move usermodehelper_enable() to populate_rootfs() · b234ed6d
      Rasmus Villemoes authored
      Currently, usermodehelper is enabled right before PID1 starts going
      through the initcalls. However, any call of a usermodehelper from a
      pure_, core_, postcore_, arch_, subsys_ or fs_ initcall is futile, as
      there is no filesystem contents yet.
      
      Up until commit e7cb072e ("init/initramfs.c: do unpacking
      asynchronously"), such calls, whether via some request_module(), a
      legacy uevent "/sbin/hotplug" notification or something else, would
      just fail silently with (presumably) -ENOENT from
      kernel_execve(). However, that commit introduced the
      wait_for_initramfs() synchronization hook which must be called from
      the usermodehelper exec path right before the kernel_execve, in order
      that request_module() et al done from *after* rootfs_initcall()
      time (i.e. device_ and late_ initcalls) would continue to find a
      populated initramfs as they used to.
      
      Any call of wait_for_initramfs() done before the unpacking has been
      scheduled (i.e. before rootfs_initcall time) must just return
      immediately [and let the caller find an empty file system] in order
      not to deadlock the machine. I mistakenly thought, and my limited
      testing confirmed, that there were no such calls, so I added a
      pr_warn_once() in wait_for_initramfs(). It turns out that one can
      indeed hit request_module() as well as kobject_uevent_env() during
      those early init calls, leading to a user-visible warning in the
      kernel log emitted consistently for certain configurations.
      
      We could just remove the pr_warn_once(), but I think it's better to
      postpone enabling the usermodehelper framework until there is at least
      some chance of finding the executable. That is also a little more
      efficient in that a lot of work done in umh.c will be elided. However,
      it does change the error seen by those early callers from -ENOENT to
      -EBUSY, so there is a risk of a regression if any caller care about
      the exact error value.
      
      Link: https://lkml.kernel.org/r/20210728134638.329060-1-linux@rasmusvillemoes.dk
      
      
      Fixes: e7cb072e ("init/initramfs.c: do unpacking asynchronously")
      Signed-off-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Reported-by: default avatarAlexander Egorenkov <egorenar@linux.ibm.com>
      Reported-by: default avatarBruno Goncalves <bgoncalv@redhat.com>
      Reported-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b234ed6d
    • Marco Elver's avatar
      kbuild: Only default to -Werror if COMPILE_TEST · b339ec9c
      Marco Elver authored
      
      The cross-product of the kernel's supported toolchains, architectures,
      and configuration options is large. So large, that it's generally
      accepted to be infeasible to enumerate and build+test them all
      (many compile-testers rely on randomly generated configs).
      
      Without the possibility to enumerate all possible combinations of
      toolchains, architectures, and configuration options, it is inevitable
      that compiler warnings in this space exist.
      
      With -Werror, this means that an innumerable set of kernels are now
      broken, yet had been perfectly usable before (confused compilers, code
      with warnings unused, or luck).
      
      Distributors will necessarily pick a point in the toolchain X arch X
      config space, and if unlucky, will have a broken build. Granted, those
      will likely disable CONFIG_WERROR and move on.
      
      The kernel's default configuration is unlikely to be suitable for all
      users, but it's inappropriate to force many users to set CONFIG_WERROR=n.
      
      This also holds for CI systems which are focused on runtime testing,
      where the odd warning in some subsystem will disrupt testing of the rest
      of the kernel. Many of those runtime-focused CI systems run tests or
      fuzz the kernel using runtime debugging tools. Runtime testing of
      different subsystems can proceed in parallel, and potentially uncover
      serious bugs; halting runtime testing of the entire kernel because of
      the odd warning (now error) in a subsystem or driver is simply
      inappropriate.
      
      Therefore, runtime-focused CI systems will likely choose CONFIG_WERROR=n
      as well.
      
      The appropriate usecase for -Werror is therefore compile-test focused
      builds (often done by developers or CI systems).
      
      Reflect this in the Kconfig option by making the default value of WERROR
      match COMPILE_TEST.
      
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Acked-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Acked-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Reviwed-by: default avatarMark Brown <broonie@kernel.org>
      Reviewed-by: default avatarNathan Chancellor <nathan@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b339ec9c
  4. Sep 05, 2021
    • Linus Torvalds's avatar
      Enable '-Werror' by default for all kernel builds · 3fe617cc
      Linus Torvalds authored
      
      ... but make it a config option so that broken environments can disable
      it when required.
      
      We really should always have a clean build, and will disable specific
      over-eager warnings as required, if we can't fix them.  But while I
      fairly religiously enforce that in my own tree, it doesn't get enforced
      by various build robots that don't necessarily report warnings.
      
      So this just makes '-Werror' a default compiler flag, but allows people
      to disable it for their configuration if they have some particular
      issues.
      
      Occasionally, new compiler versions end up enabling new warnings, and it
      can take a while before we have them fixed (or the warnings disabled if
      that is what it takes), so the config option allows for that situation.
      
      Hopefully this will mean that I get fewer pull requests that have new
      warnings that were not noticed by various automation we have in place.
      
      Knock wood.
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3fe617cc
  5. Aug 24, 2021
  6. Aug 23, 2021
  7. Aug 20, 2021
  8. Aug 12, 2021
  9. Aug 03, 2021
  10. Jul 26, 2021
    • John Ogness's avatar
      printk: remove NMI tracking · 85e3e7fb
      John Ogness authored
      
      All NMI contexts are handled the same as the safe context: store the
      message and defer printing. There is no need to have special NMI
      context tracking for this. Using in_nmi() is enough.
      
      There are several parts of the kernel that are manually calling into
      the printk NMI context tracking in order to cause general printk
      deferred printing:
      
          arch/arm/kernel/smp.c
          arch/powerpc/kexec/crash.c
          kernel/trace/trace.c
      
      For arm/kernel/smp.c and powerpc/kexec/crash.c, provide a new
      function pair printk_deferred_enter/exit that explicitly achieves the
      same objective.
      
      For ftrace, remove the printk context manipulation completely. It was
      added in commit 03fc7f9c ("printk/nmi: Prevent deadlock when
      accessing the main log buffer in NMI"). The purpose was to enforce
      storing messages directly into the ring buffer even in NMI context.
      It really should have only modified the behavior in NMI context.
      There is no need for a special behavior any longer. All messages are
      always stored directly now. The console deferring is handled
      transparently in vprintk().
      
      Signed-off-by: default avatarJohn Ogness <john.ogness@linutronix.de>
      [pmladek@suse.com: Remove special handling in ftrace.c completely.
      Signed-off-by: default avatarPetr Mladek <pmladek@suse.com>
      Link: https://lore.kernel.org/r/20210715193359.25946-5-john.ogness@linutronix.de
      85e3e7fb
  11. Jul 19, 2021
    • Chris Down's avatar
      printk: Userspace format indexing support · 33701557
      Chris Down authored
      
      We have a number of systems industry-wide that have a subset of their
      functionality that works as follows:
      
      1. Receive a message from local kmsg, serial console, or netconsole;
      2. Apply a set of rules to classify the message;
      3. Do something based on this classification (like scheduling a
         remediation for the machine), rinse, and repeat.
      
      As a couple of examples of places we have this implemented just inside
      Facebook, although this isn't a Facebook-specific problem, we have this
      inside our netconsole processing (for alarm classification), and as part
      of our machine health checking. We use these messages to determine
      fairly important metrics around production health, and it's important
      that we get them right.
      
      While for some kinds of issues we have counters, tracepoints, or metrics
      with a stable interface which can reliably indicate the issue, in order
      to react to production issues quickly we need to work with the interface
      which most kernel developers naturally use when developing: printk.
      
      Most production issues come from unexpected phenomena, and as such
      usually the code in question doesn't have easily usable tracepoints or
      other counters available for the specific problem being mitigated. We
      have a number of lines of monitoring defence against problems in
      production (host metrics, process metrics, service metrics, etc), and
      where it's not feasible to reliably monitor at another level, this kind
      of pragmatic netconsole monitoring is essential.
      
      As one would expect, monitoring using printk is rather brittle for a
      number of reasons -- most notably that the message might disappear
      entirely in a new version of the kernel, or that the message may change
      in some way that the regex or other classification methods start to
      silently fail.
      
      One factor that makes this even harder is that, under normal operation,
      many of these messages are never expected to be hit. For example, there
      may be a rare hardware bug which one wants to detect if it was to ever
      happen again, but its recurrence is not likely or anticipated. This
      precludes using something like checking whether the printk in question
      was printed somewhere fleetwide recently to determine whether the
      message in question is still present or not, since we don't anticipate
      that it should be printed anywhere, but still need to monitor for its
      future presence in the long-term.
      
      This class of issue has happened on a number of occasions, causing
      unhealthy machines with hardware issues to remain in production for
      longer than ideal. As a recent example, some monitoring around
      blk_update_request fell out of date and caused semi-broken machines to
      remain in production for longer than would be desirable.
      
      Searching through the codebase to find the message is also extremely
      fragile, because many of the messages are further constructed beyond
      their callsite (eg. btrfs_printk and other module-specific wrappers,
      each with their own functionality). Even if they aren't, guessing the
      format and formulation of the underlying message based on the aesthetics
      of the message emitted is not a recipe for success at scale, and our
      previous issues with fleetwide machine health checking demonstrate as
      much.
      
      This provides a solution to the issue of silently changed or deleted
      printks: we record pointers to all printk format strings known at
      compile time into a new .printk_index section, both in vmlinux and
      modules. At runtime, this can then be iterated by looking at
      <debugfs>/printk/index/<module>, which emits the following format, both
      readable by humans and able to be parsed by machines:
      
          $ head -1 vmlinux; shuf -n 5 vmlinux
          # <level[,flags]> filename:line function "format"
          <5> block/blk-settings.c:661 disk_stack_limits "%s: Warning: Device %s is misaligned\n"
          <4> kernel/trace/trace.c:8296 trace_create_file "Could not create tracefs '%s' entry\n"
          <6> arch/x86/kernel/hpet.c:144 _hpet_print_config "hpet: %s(%d):\n"
          <6> init/do_mounts.c:605 prepare_namespace "Waiting for root device %s...\n"
          <6> drivers/acpi/osl.c:1410 acpi_no_auto_serialize_setup "ACPI: auto-serialization disabled\n"
      
      This mitigates the majority of cases where we have a highly-specific
      printk which we want to match on, as we can now enumerate and check
      whether the format changed or the printk callsite disappeared entirely
      in userspace. This allows us to catch changes to printks we monitor
      earlier and decide what to do about it before it becomes problematic.
      
      There is no additional runtime cost for printk callers or printk itself,
      and the assembly generated is exactly the same.
      
      Signed-off-by: default avatarChris Down <chris@chrisdown.name>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Jessica Yu <jeyu@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: John Ogness <john.ogness@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Reviewed-by: default avatarPetr Mladek <pmladek@suse.com>
      Tested-by: default avatarPetr Mladek <pmladek@suse.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Acked-by: default avatarAndy Shevchenko <andy.shevchenko@gmail.com>
      Acked-by: Jessica Yu <jeyu@kernel.org> # for module.{c,h}
      Signed-off-by: default avatarPetr Mladek <pmladek@suse.com>
      Link: https://lore.kernel.org/r/e42070983637ac5e384f17fbdbe86d19c7b212a5.1623775748.git.chris@chrisdown.name
      33701557
  12. Jul 17, 2021
  13. Jul 08, 2021
  14. Jul 01, 2021
    • Andrew Halaney's avatar
      init: print out unknown kernel parameters · 86d1919a
      Andrew Halaney authored
      It is easy to foobar setting a kernel parameter on the command line
      without realizing it, there's not much output that you can use to assess
      what the kernel did with that parameter by default.
      
      Make it a little more explicit which parameters on the command line
      _looked_ like a valid parameter for the kernel, but did not match anything
      and ultimately got tossed to init.  This is very similar to the unknown
      parameter message received when loading a module.
      
      This assumes the parameters are processed in a normal fashion, some
      parameters (dyndbg= for example) don't register their parameter with the
      rest of the kernel's parameters, and therefore always show up in this list
      (and are also given to init - like the rest of this list).
      
      Another example is BOOT_IMAGE= is highlighted as an offender, which it
      technically is, but is passed by LILO and GRUB so most systems will see
      that complaint.
      
      An example output where "foobared" and "unrecognized" are intentionally
      invalid parameters:
      
        Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.12-dirty debug log_buf_len=4M foobared unrecognized=foo
        Unknown command line parameters: foobared BOOT_IMAGE=/boot/vmlinuz-5.12-dirty unrecognized=foo
      
      Link: https://lkml.kernel.org/r/20210511211009.42259-1-ahalaney@redhat.com
      
      
      Signed-off-by: default avatarAndrew Halaney <ahalaney@redhat.com>
      Suggested-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Suggested-by: default avatarBorislav Petkov <bp@suse.de>
      Acked-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      86d1919a
  15. Jun 22, 2021
  16. Jun 18, 2021
  17. Jun 10, 2021
  18. Jun 05, 2021
    • Mark Rutland's avatar
      pid: take a reference when initializing `cad_pid` · 0711f0d7
      Mark Rutland authored
      During boot, kernel_init_freeable() initializes `cad_pid` to the init
      task's struct pid.  Later on, we may change `cad_pid` via a sysctl, and
      when this happens proc_do_cad_pid() will increment the refcount on the
      new pid via get_pid(), and will decrement the refcount on the old pid
      via put_pid().  As we never called get_pid() when we initialized
      `cad_pid`, we decrement a reference we never incremented, can therefore
      free the init task's struct pid early.  As there can be dangling
      references to the struct pid, we can later encounter a use-after-free
      (e.g.  when delivering signals).
      
      This was spotted when fuzzing v5.13-rc3 with Syzkaller, but seems to
      have been around since the conversion of `cad_pid` to struct pid in
      commit 9ec52099 ("[PATCH] replace cad_pid by a struct pid") from the
      pre-KASAN stone age of v2.6.19.
      
      Fix this by getting a reference to the init task's struct pid when we
      assign it to `cad_pid`.
      
      Full KASAN splat below.
      
         ==================================================================
         BUG: KASAN: use-after-free in ns_of_pid include/linux/pid.h:153 [inline]
         BUG: KASAN: use-after-free in task_active_pid_ns+0xc0/0xc8 kernel/pid.c:509
         Read of size 4 at addr ffff23794dda0004 by task syz-executor.0/273
      
         CPU: 1 PID: 273 Comm: syz-executor.0 Not tainted 5.12.0-00001-g9aef892b2d15 #1
         Hardware name: linux,dummy-virt (DT)
         Call trace:
          ns_of_pid include/linux/pid.h:153 [inline]
          task_active_pid_ns+0xc0/0xc8 kernel/pid.c:509
          do_notify_parent+0x308/0xe60 kernel/signal.c:1950
          exit_notify kernel/exit.c:682 [inline]
          do_exit+0x2334/0x2bd0 kernel/exit.c:845
          do_group_exit+0x108/0x2c8 kernel/exit.c:922
          get_signal+0x4e4/0x2a88 kernel/signal.c:2781
          do_signal arch/arm64/kernel/signal.c:882 [inline]
          do_notify_resume+0x300/0x970 arch/arm64/kernel/signal.c:936
          work_pending+0xc/0x2dc
      
         Allocated by task 0:
          slab_post_alloc_hook+0x50/0x5c0 mm/slab.h:516
          slab_alloc_node mm/slub.c:2907 [inline]
          slab_alloc mm/slub.c:2915 [inline]
          kmem_cache_alloc+0x1f4/0x4c0 mm/slub.c:2920
          alloc_pid+0xdc/0xc00 kernel/pid.c:180
          copy_process+0x2794/0x5e18 kernel/fork.c:2129
          kernel_clone+0x194/0x13c8 kernel/fork.c:2500
          kernel_thread+0xd4/0x110 kernel/fork.c:2552
          rest_init+0x44/0x4a0 init/main.c:687
          arch_call_rest_init+0x1c/0x28
          start_kernel+0x520/0x554 init/main.c:1064
          0x0
      
         Freed by task 270:
          slab_free_hook mm/slub.c:1562 [inline]
          slab_free_freelist_hook+0x98/0x260 mm/slub.c:1600
          slab_free mm/slub.c:3161 [inline]
          kmem_cache_free+0x224/0x8e0 mm/slub.c:3177
          put_pid.part.4+0xe0/0x1a8 kernel/pid.c:114
          put_pid+0x30/0x48 kernel/pid.c:109
          proc_do_cad_pid+0x190/0x1b0 kernel/sysctl.c:1401
          proc_sys_call_handler+0x338/0x4b0 fs/proc/proc_sysctl.c:591
          proc_sys_write+0x34/0x48 fs/proc/proc_sysctl.c:617
          call_write_iter include/linux/fs.h:1977 [inline]
          new_sync_write+0x3ac/0x510 fs/read_write.c:518
          vfs_write fs/read_write.c:605 [inline]
          vfs_write+0x9c4/0x1018 fs/read_write.c:585
          ksys_write+0x124/0x240 fs/read_write.c:658
          __do_sys_write fs/read_write.c:670 [inline]
          __se_sys_write fs/read_write.c:667 [inline]
          __arm64_sys_write+0x78/0xb0 fs/read_write.c:667
          __invoke_syscall arch/arm64/kernel/syscall.c:37 [inline]
          invoke_syscall arch/arm64/kernel/syscall.c:49 [inline]
          el0_svc_common.constprop.1+0x16c/0x388 arch/arm64/kernel/syscall.c:129
          do_el0_svc+0xf8/0x150 arch/arm64/kernel/syscall.c:168
          el0_svc+0x28/0x38 arch/arm64/kernel/entry-common.c:416
          el0_sync_handler+0x134/0x180 arch/arm64/kernel/entry-common.c:432
          el0_sync+0x154/0x180 arch/arm64/kernel/entry.S:701
      
         The buggy address belongs to the object at ffff23794dda0000
          which belongs to the cache pid of size 224
         The buggy address is located 4 bytes inside of
          224-byte region [ffff23794dda0000, ffff23794dda00e0)
         The buggy address belongs to the page:
         page:(____ptrval____) refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4dda0
         head:(____ptrval____) order:1 compound_mapcount:0
         flags: 0x3fffc0000010200(slab|head)
         raw: 03fffc0000010200 dead000000000100 dead000000000122 ffff23794d40d080
         raw: 0000000000000000 0000000000190019 00000001ffffffff 0000000000000000
         page dumped because: kasan: bad access detected
      
         Memory state around the buggy address:
          ffff23794dd9ff00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
          ffff23794dd9ff80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
         >ffff23794dda0000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                            ^
          ffff23794dda0080: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
          ffff23794dda0100: fc fc fc fc fc fc fc fc 00 00 00 00 00 00 00 00
         ==================================================================
      
      Link: https://lkml.kernel.org/r/20210524172230.38715-1-mark.rutland@arm.com
      
      
      Fixes: 9ec52099 ("[PATCH] replace cad_pid by a struct pid")
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0711f0d7
  19. Jun 01, 2021
  20. May 26, 2021
  21. May 12, 2021
    • Valentin Schneider's avatar
      sched/core: Initialize the idle task with preemption disabled · f1a0a376
      Valentin Schneider authored and Ingo Molnar's avatar Ingo Molnar committed
      
      As pointed out by commit
      
        de9b8f5d ("sched: Fix crash trying to dequeue/enqueue the idle thread")
      
      init_idle() can and will be invoked more than once on the same idle
      task. At boot time, it is invoked for the boot CPU thread by
      sched_init(). Then smp_init() creates the threads for all the secondary
      CPUs and invokes init_idle() on them.
      
      As the hotplug machinery brings the secondaries to life, it will issue
      calls to idle_thread_get(), which itself invokes init_idle() yet again.
      In this case it's invoked twice more per secondary: at _cpu_up(), and at
      bringup_cpu().
      
      Given smp_init() already initializes the idle tasks for all *possible*
      CPUs, no further initialization should be required. Now, removing
      init_idle() from idle_thread_get() exposes some interesting expectations
      with regards to the idle task's preempt_count: the secondary startup always
      issues a preempt_disable(), requiring some reset of the preempt count to 0
      between hot-unplug and hotplug, which is currently served by
      idle_thread_get() -> idle_init().
      
      Given the idle task is supposed to have preemption disabled once and never
      see it re-enabled, it seems that what we actually want is to initialize its
      preempt_count to PREEMPT_DISABLED and leave it there. Do that, and remove
      init_idle() from idle_thread_get().
      
      Secondary startups were patched via coccinelle:
      
        @begone@
        @@
      
        -preempt_disable();
        ...
        cpu_startup_entry(CPUHP_AP_ONLINE_IDLE);
      
      Signed-off-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20210512094636.2958515-1-valentin.schneider@arm.com
      f1a0a376
  22. May 11, 2021
  23. May 10, 2021
    • Frederic Weisbecker's avatar
      srcu: Initialize SRCU after timers · 8e9c01c7
      Frederic Weisbecker authored
      
      Once srcu_init() is called, the SRCU core will make use of delayed
      workqueues, which rely on timers.  However init_timers() is called
      several steps after rcu_init().  This means that a call_srcu() after
      rcu_init() but before init_timers() would find itself within a dangerously
      uninitialized timer core.
      
      This commit therefore creates a separate call to srcu_init() after
      init_timer() completes, which ensures that we stay in early SRCU mode
      until timers are safe(r).
      
      Signed-off-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      8e9c01c7
  24. May 07, 2021
    • Rasmus Villemoes's avatar
      modules: add CONFIG_MODPROBE_PATH · 17652f42
      Rasmus Villemoes authored
      Allow the developer to specifiy the initial value of the modprobe_path[]
      string.  This can be used to set it to the empty string initially, thus
      effectively disabling request_module() during early boot until userspace
      writes a new value via the /proc/sys/kernel/modprobe interface.  [1]
      
      When building a custom kernel (often for an embedded target), it's normal
      to build everything into the kernel that is needed for booting, and indeed
      the initramfs often contains no modules at all, so every such
      request_module() done before userspace init has mounted the real rootfs is
      a waste of time.
      
      This is particularly useful when combined with the previous patch, which
      made the initramfs unpacking asynchronous - for that to work, it had to
      make any usermodehelper call wait for the unpacking to finish before
      attempting to invoke the userspace helper.  By eliminating all such
      (known-to-be-futile) calls of usermodehelper, the initramfs unpacking and
      the {device,late}_initcalls can proceed in parallel for much longer.
      
      For a relatively slow ppc board I'm working on, the two patches combined
      lead to 0.2s faster boot - but more importantly, the fact that the
      initramfs unpacking proceeds completely in the background while devices
      get probed means I get to handle the gpio watchdog in time without getting
      reset.
      
      [1] __request_module() already has an early -ENOENT return when
      modprobe_path is the empty string.
      
      Link: https://lkml.kernel.org/r/20210313212528.2956377-3-linux@rasmusvillemoes.dk
      
      
      Signed-off-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Acked-by: default avatarJessica Yu <jeyu@kernel.org>
      Acked-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      17652f42
    • Rasmus Villemoes's avatar
      init/initramfs.c: do unpacking asynchronously · e7cb072e
      Rasmus Villemoes authored
      Patch series "background initramfs unpacking, and CONFIG_MODPROBE_PATH", v3.
      
      These two patches are independent, but better-together.
      
      The second is a rather trivial patch that simply allows the developer to
      change "/sbin/modprobe" to something else - e.g.  the empty string, so
      that all request_module() during early boot return -ENOENT early, without
      even spawning a usermode helper, needlessly synchronizing with the
      initramfs unpacking.
      
      The first patch delegates decompressing the initramfs to a worker thread,
      allowing do_initcalls() in main.c to proceed to the device_ and late_
      initcalls without waiting for that decompression (and populating of
      rootfs) to finish.  Obviously, some of those later calls may rely on the
      initramfs being available, so I've added synchronization points in the
      firmware loader and usermodehelper paths - there might be other places
      that would need this, but so far no one has been able to think of any
      places I have missed.
      
      There's not much to win if most of the functionality needed during boot is
      only available as modules.  But systems with a custom-made .config and
      initramfs can boot faster, partly due to utilizing more than one cpu
      earlier, partly by avoiding known-futile modprobe calls (which would still
      trigger synchronization with the initramfs unpacking, thus eliminating
      most of the first benefit).
      
      This patch (of 2):
      
      Most of the boot process doesn't actually need anything from the
      initramfs, until of course PID1 is to be executed.  So instead of doing
      the decompressing and populating of the initramfs synchronously in
      populate_rootfs() itself, push that off to a worker thread.
      
      This is primarily motivated by an embedded ppc target, where unpacking
      even the rather modest sized initramfs takes 0.6 seconds, which is long
      enough that the external watchdog becomes unhappy that it doesn't get
      attention soon enough.  By doing the initramfs decompression in a worker
      thread, we get to do the device_initcalls and hence start petting the
      watchdog much sooner.
      
      Normal desktops might benefit as well.  On my mostly stock Ubuntu kernel,
      my initramfs is a 26M xz-compressed blob, decompressing to around 126M.
      That takes almost two seconds:
      
      [    0.201454] Trying to unpack rootfs image as initramfs...
      [    1.976633] Freeing initrd memory: 29416K
      
      Before this patch, these lines occur consecutively in dmesg.  With this
      patch, the timestamps on these two lines is roughly the same as above, but
      with 172 lines inbetween - so more than one cpu has been kept busy doing
      work that would otherwise only happen after the populate_rootfs()
      finished.
      
      Should one of the initcalls done after rootfs_initcall time (i.e., device_
      and late_ initcalls) need something from the initramfs (say, a kernel
      module or a firmware blob), it will simply wait for the initramfs
      unpacking to be done before proceeding, which should in theory make this
      completely safe.
      
      But if some driver pokes around in the filesystem directly and not via one
      of the official kernel interfaces (i.e.  request_firmware*(),
      call_usermodehelper*) that theory may not hold - also, I certainly might
      have missed a spot when sprinkling wait_for_initramfs().  So there is an
      escape hatch in the form of an initramfs_async= command line parameter.
      
      Link: https://lkml.kernel.org/r/20210313212528.2956377-1-linux@rasmusvillemoes.dk
      Link: https://lkml.kernel.org/r/20210313212528.2956377-2-linux@rasmusvillemoes.dk
      
      
      Signed-off-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Reviewed-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Cc: Jessica Yu <jeyu@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e7cb072e
  25. May 05, 2021
    • Axel Rasmussen's avatar
      userfaultfd: add minor fault registration mode · 7677f7fd
      Axel Rasmussen authored
      Patch series "userfaultfd: add minor fault handling", v9.
      
      Overview
      ========
      
      This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS.
      When enabled (via the UFFDIO_API ioctl), this feature means that any
      hugetlbfs VMAs registered with UFFDIO_REGISTER_MODE_MISSING will *also*
      get events for "minor" faults.  By "minor" fault, I mean the following
      situation:
      
      Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
      memory).  One of the mappings is registered with userfaultfd (in minor
      mode), and the other is not.  Via the non-UFFD mapping, the underlying
      pages have already been allocated & filled with some contents.  The UFFD
      mapping has not yet been faulted in; when it is touched for the first
      time, this results in what I'm calling a "minor" fault.  As a concrete
      example, when working with hugetlbfs, we have huge_pte_none(), but
      find_lock_page() finds an existing page.
      
      We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE.  The idea
      is, userspace resolves the fault by either a) doing nothing if the
      contents are already correct, or b) updating the underlying contents using
      the second, non-UFFD mapping (via memcpy/memset or similar, or something
      fancier like RDMA, or etc...).  In either case, userspace issues
      UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are
      correct, carry on setting up the mapping".
      
      Use Case
      ========
      
      Consider the use case of VM live migration (e.g. under QEMU/KVM):
      
      1. While a VM is still running, we copy the contents of its memory to a
         target machine. The pages are populated on the target by writing to the
         non-UFFD mapping, using the setup described above. The VM is still running
         (and therefore its memory is likely changing), so this may be repeated
         several times, until we decide the target is "up to date enough".
      
      2. We pause the VM on the source, and start executing on the target machine.
         During this gap, the VM's user(s) will *see* a pause, so it is desirable to
         minimize this window.
      
      3. Between the last time any page was copied from the source to the target, and
         when the VM was paused, the contents of that page may have changed - and
         therefore the copy we have on the target machine is out of date. Although we
         can keep track of which pages are out of date, for VMs with large amounts of
         memory, it is "slow" to transfer this information to the target machine. We
         want to resume execution before such a transfer would complete.
      
      4. So, the guest begins executing on the target machine. The first time it
         touches its memory (via the UFFD-registered mapping), userspace wants to
         intercept this fault. Userspace checks whether or not the page is up to date,
         and if not, copies the updated page from the source machine, via the non-UFFD
         mapping. Finally, whether a copy was performed or not, userspace issues a
         UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
         are correct, carry on setting up the mapping".
      
      We don't have to do all of the final updates on-demand. The userfaultfd manager
      can, in the background, also copy over updated pages once it receives the map of
      which pages are up-to-date or not.
      
      Interaction with Existing APIs
      ==============================
      
      Because this is a feature, a registered VMA could potentially receive both
      missing and minor faults.  I spent some time thinking through how the
      existing API interacts with the new feature:
      
      UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
      allocate a new page.  If UFFDIO_CONTINUE is used on a non-minor fault:
      
      - For non-shared memory or shmem, -EINVAL is returned.
      - For hugetlb, -EFAULT is returned.
      
      UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults.
      Without modifications, the existing codepath assumes a new page needs to
      be allocated.  This is okay, since userspace must have a second
      non-UFFD-registered mapping anyway, thus there isn't much reason to want
      to use these in any case (just memcpy or memset or similar).
      
      - If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
      - If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
        in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
      - UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
        -ENOENT in that case (regardless of the kind of fault).
      
      Future Work
      ===========
      
      This series only supports hugetlbfs.  I have a second series in flight to
      support shmem as well, extending the functionality.  This series is more
      mature than the shmem support at this point, and the functionality works
      fully on hugetlbfs, so this series can be merged first and then shmem
      support will follow.
      
      This patch (of 6):
      
      This feature allows userspace to intercept "minor" faults.  By "minor"
      faults, I mean the following situation:
      
      Let there exist two mappings (i.e., VMAs) to the same page(s).  One of the
      mappings is registered with userfaultfd (in minor mode), and the other is
      not.  Via the non-UFFD mapping, the underlying pages have already been
      allocated & filled with some contents.  The UFFD mapping has not yet been
      faulted in; when it is touched for the first time, this results in what
      I'm calling a "minor" fault.  As a concrete example, when working with
      hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing
      page.
      
      This commit adds the new registration mode, and sets the relevant flag on
      the VMAs being registered.  In the hugetlb fault path, if we find that we
      have huge_pte_none(), but find_lock_page() does indeed find an existing
      page, then we have a "minor" fault, and if the VMA has the userfaultfd
      registration flag, we call into userfaultfd to handle it.
      
      This is implemented as a new registration mode, instead of an API feature.
      This is because the alternative implementation has significant drawbacks
      [1].
      
      However, doing it this was requires we allocate a VM_* flag for the new
      registration mode.  On 32-bit systems, there are no unused bits, so this
      feature is only supported on architectures with
      CONFIG_ARCH_USES_HIGH_VMA_FLAGS.  When attempting to register a VMA in
      MINOR mode on 32-bit architectures, we return -EINVAL.
      
      [1] https://lore.kernel.org/patchwork/patch/1380226/
      
      [peterx@redhat.com: fix minor fault page leak]
        Link: https://lkml.kernel.org/r/20210322175132.36659-1-peterx@redhat.com
      
      Link: https://lkml.kernel.org/r/20210301222728.176417-1-axelrasmussen@google.com
      Link: https://lkml.kernel.org/r/20210301222728.176417-2-axelrasmussen@google.com
      
      
      Signed-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shawn Anastasio <shawn@anastas.io>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Adam Ruprecht <ruprecht@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7677f7fd
  26. Apr 30, 2021
  27. Apr 27, 2021
    • Florent Revest's avatar
      bpf: Implement formatted output helpers with bstr_printf · 48cac3f4
      Florent Revest authored
      
      BPF has three formatted output helpers: bpf_trace_printk, bpf_seq_printf
      and bpf_snprintf. Their signatures specify that all arguments are
      provided from the BPF world as u64s (in an array or as registers). All
      of these helpers are currently implemented by calling functions such as
      snprintf() whose signatures take a variable number of arguments, then
      placed in a va_list by the compiler to call vsnprintf().
      
      "d9c9e4db bpf: Factorize bpf_trace_printk and bpf_seq_printf" introduced
      a bpf_printf_prepare function that fills an array of u64 sanitized
      arguments with an array of "modifiers" which indicate what the "real"
      size of each argument should be (given by the format specifier). The
      BPF_CAST_FMT_ARG macro consumes these arrays and casts each argument to
      its real size. However, the C promotion rules implicitely cast them all
      back to u64s. Therefore, the arguments given to snprintf are u64s and
      the va_list constructed by the compiler will use 64 bits for each
      argument. On 64 bit machines, this happens to work well because 32 bit
      arguments in va_lists need to occupy 64 bits anyway, but on 32 bit
      architectures this breaks the layout of the va_list expected by the
      called function and mangles values.
      
      In "88a5c690 bpf: fix bpf_trace_printk on 32 bit archs", this problem
      had been solved for bpf_trace_printk only with a "horrid workaround"
      that emitted multiple calls to trace_printk where each call had
      different argument types and generated different va_list layouts. One of
      the call would be dynamically chosen at runtime. This was ok with the 3
      arguments that bpf_trace_printk takes but bpf_seq_printf and
      bpf_snprintf accept up to 12 arguments. Because this approach scales
      code exponentially, it is not a viable option anymore.
      
      Because the promotion rules are part of the language and because the
      construction of a va_list is an arch-specific ABI, it's best to just
      avoid variadic arguments and va_lists altogether. Thankfully the
      kernel's snprintf() has an alternative in the form of bstr_printf() that
      accepts arguments in a "binary buffer representation". These binary
      buffers are currently created by vbin_printf and used in the tracing
      subsystem to split the cost of printing into two parts: a fast one that
      only dereferences and remembers values, and a slower one, called later,
      that does the pretty-printing.
      
      This patch refactors bpf_printf_prepare to construct binary buffers of
      arguments consumable by bstr_printf() instead of arrays of arguments and
      modifiers. This gets rid of BPF_CAST_FMT_ARG and greatly simplifies the
      bpf_printf_prepare usage but there are a few gotchas that change how
      bpf_printf_prepare needs to do things.
      
      Currently, bpf_printf_prepare uses a per cpu temporary buffer as a
      generic storage for strings and IP addresses. With this refactoring, the
      temporary buffers now holds all the arguments in a structured binary
      format.
      
      To comply with the format expected by bstr_printf, certain format
      specifiers also need to be pre-formatted: %pB and %pi6/%pi4/%pI4/%pI6.
      Because vsnprintf subroutines for these specifiers are hard to expose,
      we pre-format these arguments with calls to snprintf().
      
      Reported-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: default avatarFlorent Revest <revest@chromium.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210427174313.860948-3-revest@chromium.org
      48cac3f4
  28. Apr 24, 2021
Loading