如何助开发Java排错之堆栈监控?

Bash

1 前言

笔者生产线有一个程序经常性的停止工作,程序员也找不到具体的原因,偶尔知道Java有提供jstack工具分析程序的堆栈,于是利用该工具写了一个监视脚本,结果协助程序员找到自己代码的问题所在,完美的解决此故障。

2 最佳实践

2.1 创建监视脚本

mkdir -p ~/scripts/
vim ~/scripts/iapi-jstack.sh

加入如下内容,

#!/bin/bash

setTime=`date +"%Y-%m-%d %H:%M:%S"`
jstackStatusLog=/var/log/iapi/jstackStatus.log
jastackLog="`dirname "$jstackStatusLog"`/jstack/jstack.$setTime.log"
javaUser="iapi"
javaHome="/usr/java/jdk1.8.0_65"

if [ ! -d `dirname "$jastackLog"` ]; then
        mkdir -p `dirname "$jastackLog"`
fi

javaPid=`pgrep -u "$javaUser" java`
sudo -u "$javaUser" "$javaHome/bin/jstack" -l "$javaPid" > "$jastackLog"
echo "$setTime" >> "$jstackStatusLog"
IFS=$'\n'
for i in `grep "java.lang.Thread.State:" "$jastackLog" | sort -u | awk -F ':' '{print $2}' | sed 's/^[ \t]*//g'`; do
        echo "$i": `grep "$i" "$jastackLog" | wc -l` >> "$jstackStatusLog"
done
echo >> "$jstackStatusLog"

#find `dirname "$jastackLog"` -type f -ctime -7 -name "jstack.*.log" -exec ls -l {} \;
find `dirname "$jastackLog"` -type f -ctime +7 -name "jstack.*.log" -exec rm -f {} \;

以上脚本基于以下核心命令,

sudo -u "$javaUser" "${JAVA_HOME}/bin/jstack" `pgrep -u "$javaUser" java` > "$jastackLog"

以上脚本的含义是,
– 核心命令“jstack”监控“pgrep -u “$javaUser” java”取得的PID然后将结果输出“$jastackLog”所指向的日志文件
– 核心语句的后面echo到“$jstackStatusLog”所指向的日志文件的行用户统计不同状态的线程数量
– 最后一行用于清理过去的日志文件
另外,我们可以使用如下命令筛选出没有包含在脚本中的统计状态选项,然后手动新增到脚本中,

cat /var/log/iapi/jstack/jstack.*.log | grep "java.lang.Thread.State" | grep -v "Thread.State: WAITING (parking)" | grep -v "Thread.State: WAITING (on object monitor)" | grep -v "Thread.State: TIMED_WAITING (parking)" | grep -v "Thread.State: TIMED_WAITING (sleeping)" | grep -v "Thread.State: TIMED_WAITING (on object monitor)" | grep -v "Thread.State: RUNNABLE" | grep -v "Thread.State: BLOCKED (on object monitor)"

2.2 调用监视脚本

sh ~/scripts/iapi-jstack.sh

2.3 查看统计数据

 tail -f /var/log/iapi/jstackStatus.log

可见如下输出,

[...]
2020-10-29 02:00:01
WAITING (parking): 241
WAITING (on object monitor): 8
TIMED_WAITING (parking): 24
TIMED_WAITING (sleeping): 8
RUNNABLE: 18
BLOCKED (on object monitor): 0

2020-10-29 02:05:01
WAITING (parking): 241
WAITING (on object monitor): 8
TIMED_WAITING (parking): 25
TIMED_WAITING (sleeping): 7
RUNNABLE: 17
BLOCKED (on object monitor): 0
[...]

基于以上统计日志数据,
– “WAITING (parking)”状态的进程数有“241”个,所以线程可能有异常
– 具体错误需要由程序员分析输出到“/var/log/imapi/jstack/”目录的日志并结合开发程序拍错
查看的范例命令如下,

cat /var/log/imapi/jstack/jstack.2020-10-29\ 02\:00\:01.log

大致的日志范例如下,

[...]
"Catalina-utility-1" #12 prio=1 os_prio=0 tid=0x00007f0e8084d800 nid=0x738e waiting on condition [0x00007f0e54133000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for   (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1088)
        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
        at java.lang.Thread.run(Thread.java:745)
[...]
没有评论

发表回复

Bash
如何实现文件夹路径转纯数字符串?

1 前言 一个问题,一篇文章,一出故事。 由于由于需要设置某目录的配额,配额要求为每个目录指定一个项 …

Bash
如何统计Linux打开文件前10进程?

1 前言 一个问题,一篇文章,一出故事。 笔者生产环境有台服务最近压力比较大,打开的文件数量不断地往 …

Bash
如何获取VSFTP昨天活跃和有效用户?

1 前言 一个问题,一篇文章,一出故事。 笔者生产环境有台老旧的FTP服务器,用户众多。笔者希望每天 …