Skip to content
Snippets Groups Projects
Forked from gfx-ci / linux
Source project has a limited visibility.
  • Lance Yang's avatar
    03ecb24d
    hung_task: add detect count for hung tasks · 03ecb24d
    Lance Yang authored
    Patch series "add detect count for hung tasks", v2.
    
    This patchset adds a counter, hung_task_detect_count, to track the number
    of times hung tasks are detected.  
    
    IHMO, hung tasks are a critical metric.  Currently, we detect them by
    periodically parsing dmesg.  However, this method isn't as user-friendly
    as using a counter.
    
    Sometimes, a short-lived issue with NIC or hard drive can quickly decrease
    the hung_task_warnings to zero.  Without warnings, we must directly access
    the node to ensure that there are no more hung tasks and that the system
    has recovered.  After all, load average alone cannot provide a clear
    picture.
    
    Once this counter is in place, in a high-density deployment pattern, we
    plan to set hung_task_timeout_secs to a lower number to improve stability,
    even though this might result in false positives.  And then we can set a
    time-based threshold: if hung tasks last beyond this duration, we will
    automatically migrate containers to other nodes.  Based on past
    experience, this approach could help avoid many production disruptions.
    
    Moreover, just like other important events such as OOM that already have
    counters, having a dedicated counter for hung tasks makes sense ;)
    
    
    This patch (of 2):
    
    This commit adds a counter, hung_task_detect_count, to track the number of
    times hung tasks are detected.
    
    IHMO, hung tasks are a critical metric. Currently, we detect them by
    periodically parsing dmesg. However, this method isn't as user-friendly as
    using a counter.
    
    Sometimes, a short-lived issue with NIC or hard drive can quickly decrease
    the hung_task_warnings to zero. Without warnings, we must directly access
    the node to ensure that there are no more hung tasks and that the system
    has recovered. After all, load average alone cannot provide a clear
    picture.
    
    Once this counter is in place, in a high-density deployment pattern, we
    plan to set hung_task_timeout_secs to a lower number to improve stability,
    even though this might result in false positives. And then we can set a
    time-based threshold: if hung tasks last beyond this duration, we will
    automatically migrate containers to other nodes. Based on past experience,
    this approach could help avoid many production disruptions.
    
    Moreover, just like other important events such as OOM that already have
    counters, having a dedicated counter for hung tasks makes sense.
    
    [ioworker0@gmail.com: proc_doulongvec_minmax instead of proc_dointvec]
      Link: https://lkml.kernel.org/r/20241101114833.8377-1-ioworker0@gmail.com
    Link: https://lkml.kernel.org/r/20241027120747.42833-1-ioworker0@gmail.com
    Link: https://lkml.kernel.org/r/20241027120747.42833-2-ioworker0@gmail.com
    
    
    Signed-off-by: default avatarMingzhe Yang <mingzhe.yang@ly.com>
    Signed-off-by: default avatarLance Yang <ioworker0@gmail.com>
    Cc: Bang Li <libang.li@antgroup.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Huang Cun <cunhuang@tencent.com>
    Cc: Joel Granados <j.granados@samsung.com>
    Cc: Joel Granados <joel.granados@kernel.org>
    Cc: John Siddle <jsiddle@redhat.com>
    Cc: Kent Overstreet <kent.overstreet@linux.dev>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Thomas Weißschuh <linux@weissschuh.net>
    Cc: Yongliang Gao <leonylgao@tencent.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    03ecb24d
    History
    hung_task: add detect count for hung tasks
    Lance Yang authored
    Patch series "add detect count for hung tasks", v2.
    
    This patchset adds a counter, hung_task_detect_count, to track the number
    of times hung tasks are detected.  
    
    IHMO, hung tasks are a critical metric.  Currently, we detect them by
    periodically parsing dmesg.  However, this method isn't as user-friendly
    as using a counter.
    
    Sometimes, a short-lived issue with NIC or hard drive can quickly decrease
    the hung_task_warnings to zero.  Without warnings, we must directly access
    the node to ensure that there are no more hung tasks and that the system
    has recovered.  After all, load average alone cannot provide a clear
    picture.
    
    Once this counter is in place, in a high-density deployment pattern, we
    plan to set hung_task_timeout_secs to a lower number to improve stability,
    even though this might result in false positives.  And then we can set a
    time-based threshold: if hung tasks last beyond this duration, we will
    automatically migrate containers to other nodes.  Based on past
    experience, this approach could help avoid many production disruptions.
    
    Moreover, just like other important events such as OOM that already have
    counters, having a dedicated counter for hung tasks makes sense ;)
    
    
    This patch (of 2):
    
    This commit adds a counter, hung_task_detect_count, to track the number of
    times hung tasks are detected.
    
    IHMO, hung tasks are a critical metric. Currently, we detect them by
    periodically parsing dmesg. However, this method isn't as user-friendly as
    using a counter.
    
    Sometimes, a short-lived issue with NIC or hard drive can quickly decrease
    the hung_task_warnings to zero. Without warnings, we must directly access
    the node to ensure that there are no more hung tasks and that the system
    has recovered. After all, load average alone cannot provide a clear
    picture.
    
    Once this counter is in place, in a high-density deployment pattern, we
    plan to set hung_task_timeout_secs to a lower number to improve stability,
    even though this might result in false positives. And then we can set a
    time-based threshold: if hung tasks last beyond this duration, we will
    automatically migrate containers to other nodes. Based on past experience,
    this approach could help avoid many production disruptions.
    
    Moreover, just like other important events such as OOM that already have
    counters, having a dedicated counter for hung tasks makes sense.
    
    [ioworker0@gmail.com: proc_doulongvec_minmax instead of proc_dointvec]
      Link: https://lkml.kernel.org/r/20241101114833.8377-1-ioworker0@gmail.com
    Link: https://lkml.kernel.org/r/20241027120747.42833-1-ioworker0@gmail.com
    Link: https://lkml.kernel.org/r/20241027120747.42833-2-ioworker0@gmail.com
    
    
    Signed-off-by: default avatarMingzhe Yang <mingzhe.yang@ly.com>
    Signed-off-by: default avatarLance Yang <ioworker0@gmail.com>
    Cc: Bang Li <libang.li@antgroup.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Huang Cun <cunhuang@tencent.com>
    Cc: Joel Granados <j.granados@samsung.com>
    Cc: Joel Granados <joel.granados@kernel.org>
    Cc: John Siddle <jsiddle@redhat.com>
    Cc: Kent Overstreet <kent.overstreet@linux.dev>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Thomas Weißschuh <linux@weissschuh.net>
    Cc: Yongliang Gao <leonylgao@tencent.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>