• Kan Liang's avatar
    perf/x86/intel: Fix PEBS warning by only restoring active PMU in pmi · c3d266c8
    Kan Liang authored
    This patch tries to fix a PEBS warning found in my stress test. The
    following perf command can easily trigger the pebs warning or spurious
    NMI error on Skylake/Broadwell/Haswell platforms:
    
      sudo perf record -e 'cpu/umask=0x04,event=0xc4/pp,cycles,branches,ref-cycles,cache-misses,cache-references' --call-graph fp -b -c1000 -a
    
    Also the NMI watchdog must be enabled.
    
    For this case, the events number is larger than counter number. So
    perf has to do multiplexing.
    
    In perf_mux_hrtimer_handler, it does perf_pmu_disable(), schedule out
    old events, rotate_ctx, schedule in new events and finally
    perf_pmu_enable().
    
    If the old events include precise event, the MSR_IA32_PEBS_ENABLE
    should be cleared when perf_pmu_disable().  The MSR_IA32_PEBS_ENABLE
    should keep 0 until the perf_pmu_enable() is called and the new event is
    precise event.
    
    However, there is a corner case which could restore PEBS_ENABLE to
    stale value during the above period. In perf_pmu_disable(), GLOBAL_CTRL
    will be set to 0 to stop overflow and followed PMI. But there may be
    pending PMI from an earlier overflow, which cannot be stopped. So even
    GLOBAL_CTRL is cleared, the kernel still be possible to get PMI. At
    the end of the PMI handler, __intel_pmu_enable_all() will be called,
    which will restore the stale values if old events haven't scheduled
    out.
    
    Once the stale pebs value is set, it's impossible to be corrected if
    the new events are non-precise. Because the pebs_enabled will be set
    to 0. x86_pmu.enable_all() will ignore the MSR_IA32_PEBS_ENABLE
    setting. As a result, the following NMI with stale PEBS_ENABLE
    trigger pebs warning.
    
    The pending PMI after enabled=0 will become harmless if the NMI handler
    does not change the state. This patch checks cpuc->enabled in pmi and
    only restore the state when PMU is active.
    
    Here is the dump:
    
      Call Trace:
       <NMI>  [<ffffffff813c3a2e>] dump_stack+0x63/0x85
       [<ffffffff810a46f2>] warn_slowpath_common+0x82/0xc0
       [<ffffffff810a483a>] warn_slowpath_null+0x1a/0x20
       [<ffffffff8100fe2e>] intel_pmu_drain_pebs_nhm+0x2be/0x320
       [<ffffffff8100caa9>] intel_pmu_handle_irq+0x279/0x460
       [<ffffffff810639b6>] ? native_write_msr_safe+0x6/0x40
       [<ffffffff811f290d>] ? vunmap_page_range+0x20d/0x330
       [<ffffffff811f2f11>] ?  unmap_kernel_range_noflush+0x11/0x20
       [<ffffffff8148379f>] ? ghes_copy_tofrom_phys+0x10f/0x2a0
       [<ffffffff814839c8>] ? ghes_read_estatus+0x98/0x170
       [<ffffffff81005a7d>] perf_event_nmi_handler+0x2d/0x50
       [<ffffffff810310b9>] nmi_handle+0x69/0x120
       [<ffffffff810316f6>] default_do_nmi+0xe6/0x100
       [<ffffffff810317f2>] do_nmi+0xe2/0x130
       [<ffffffff817aea71>] end_repeat_nmi+0x1a/0x1e
       [<ffffffff810639b6>] ? native_write_msr_safe+0x6/0x40
       [<ffffffff810639b6>] ? native_write_msr_safe+0x6/0x40
       [<ffffffff810639b6>] ? native_write_msr_safe+0x6/0x40
       <<EOE>>  <IRQ>  [<ffffffff81006df8>] ?  x86_perf_event_set_period+0xd8/0x180
       [<ffffffff81006eec>] x86_pmu_start+0x4c/0x100
       [<ffffffff8100722d>] x86_pmu_enable+0x28d/0x300
       [<ffffffff811994d7>] perf_pmu_enable.part.81+0x7/0x10
       [<ffffffff8119cb70>] perf_mux_hrtimer_handler+0x200/0x280
       [<ffffffff8119c970>] ?  __perf_install_in_context+0xc0/0xc0
       [<ffffffff8110f92d>] __hrtimer_run_queues+0xfd/0x280
       [<ffffffff811100d8>] hrtimer_interrupt+0xa8/0x190
       [<ffffffff81199080>] ?  __perf_read_group_add.part.61+0x1a0/0x1a0
       [<ffffffff81051bd8>] local_apic_timer_interrupt+0x38/0x60
       [<ffffffff817af01d>] smp_apic_timer_interrupt+0x3d/0x50
       [<ffffffff817ad15c>] apic_timer_interrupt+0x8c/0xa0
       <EOI>  [<ffffffff81199080>] ?  __perf_read_group_add.part.61+0x1a0/0x1a0
       [<ffffffff81123de5>] ?  smp_call_function_single+0xd5/0x130
       [<ffffffff81123ddb>] ?  smp_call_function_single+0xcb/0x130
       [<ffffffff81199080>] ?  __perf_read_group_add.part.61+0x1a0/0x1a0
       [<ffffffff8119765a>] event_function_call+0x10a/0x120
       [<ffffffff8119c660>] ? ctx_resched+0x90/0x90
       [<ffffffff811971e0>] ? cpu_clock_event_read+0x30/0x30
       [<ffffffff811976d0>] ? _perf_event_disable+0x60/0x60
       [<ffffffff8119772b>] _perf_event_enable+0x5b/0x70
       [<ffffffff81197388>] perf_event_for_each_child+0x38/0xa0
       [<ffffffff811976d0>] ? _perf_event_disable+0x60/0x60
       [<ffffffff811a0ffd>] perf_ioctl+0x12d/0x3c0
       [<ffffffff8134d855>] ? selinux_file_ioctl+0x95/0x1e0
       [<ffffffff8124a3a1>] do_vfs_ioctl+0xa1/0x5a0
       [<ffffffff81036d29>] ? sched_clock+0x9/0x10
       [<ffffffff8124a919>] SyS_ioctl+0x79/0x90
       [<ffffffff817ac4b2>] entry_SYSCALL_64_fastpath+0x1a/0xa4
      ---[ end trace aef202839fe9a71d ]---
      Uhhuh. NMI received for unknown reason 2d on CPU 2.
      Do you have a strange power saving mode enabled?
    Signed-off-by: default avatarKan Liang <kan.liang@intel.com>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Cc: <stable@vger.kernel.org>
    Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
    Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
    Cc: Jiri Olsa <jolsa@redhat.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Stephane Eranian <eranian@google.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Vince Weaver <vincent.weaver@maine.edu>
    Link: http://lkml.kernel.org/r/1457046448-6184-1-git-send-email-kan.liang@intel.com
    [ Fixed various typos and other small details. ]
    Signed-off-by: Ingo Molnar's avatarIngo Molnar <mingo@kernel.org>
    c3d266c8
core.c 105 KB