Nagios插件开发之监控程序占用资源

65yth 发表于 2014-5-21 09:03:45

一般情况下，我们只需要监控程序进程在没在就可以了。但是这次遭遇了这样的事，公司开发的程序，程序进程还在，但是死锁了。导致大范围的影响，更要命的是根本不知道问题出在哪里,还是别的测试部同事帮忙发现的，真是丢尽运维的脸了…为避免下次再遭遇到这样的情况，分析了这次进程死锁的现象，发现死锁会占用100%的cpu，正常情况下只占用10%以内。决定编写nagios插件，用来监控程序占用的资源，包括cpu,内存等。
1、shell脚本需求分析：能设置cpu,mem的阈值，资源占用超过阈值就报警。
要能判断这个进程是否存在，若有一个不存在，则报警。

2、shell脚本执行效果如下： 1、如果输入格式不正确，则输出帮助信息
# shcomponent_resource.shUsage parament: component_resource.sh [--cpu] [--mem]
Example: component_resource.sh --cpu 50 --mem 50
2、若没超出阈值，输出资源占用情况，退出值为0
# shcomponent_resource.sh--cpu 50 --mem 50VueSERVER_cpu_use=5.6% VueCache_cpu_use=1.9%VueAgent_cpu_use=0.0% VueCenter_cpu_use=0.0% VueDaemon_cpu_use=0.0%;VueSERVER_mem_use=0.2% VueCache_mem_use=7.4% VueAgent_mem_use=0.5% VueCenter_mem_use=0.1%VueDaemon_mem_use=0.0%# echo $?0
3、若超出阈值，输出资源占用情况，退出值为2
# shcomponent_resource.sh--cpu 5 --mem 5VueSERVER_cpu_use=9.4% VueCache_cpu_use=0.0%VueAgent_cpu_use=0.0% VueCenter_cpu_use=0.0% VueDaemon_cpu_use=0.0%;VueSERVER_mem_use=0.2% VueCache_mem_use=7.4% VueAgent_mem_use=0.5%VueCenter_mem_use=0.1% VueDaemon_mem_use=0.0%# echo $?2
4、若进程不存在，输出down掉的进程，以及正常使用中的进程资源情况，退出值为2
# sh component_resource.sh--cpu 50 --mem 50Current VueDaemon VueCenter VueAgent VueCache VueSERVER is down. # echo $?23、Shell脚本代码如下：
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
# catcomponent_resource.sh
#!/bin/sh
#author:yangrong
#date:2014-05-20
#mail:10286460@qq.com

#pragrom_list=(VueDaemon VueCenter VueAgentVueCache VueSERVER VUEConnector Myswitch Slirpvde)
pragrom_list=(VueDaemon VueCenter VueAgentVueCache VueSERVER)

####获取cpu阈值和mem阈值#######
case $1 in
--cpu)
cpu_crit=$2
;;
--mem)
mem_crit=$2
;;
esac

case $3 in
--cpu)
cpu_crit=$4
;;
--mem)
mem_crit=$4
;;
esac

###判断传参数量,如果不为4，则var值为1，var0则正常####
if [[ $1 == $3]];then
   var=1
elif [ $# -ne 4 ] ;then
   var=1
else
   var=0
fi

###打印错误提示信息
if [ $var -eq 1 ];then
echo "Usage parament:"
echo " $0 [--cpu][--mem]"
echo ""
echo "Example:"
echo " $0 --cpu 50 --mem50"
exit
fi

###把不存在的进程放一变量中
num=$(( ${#pragrom_list[@]}-1 ))

NotExist=""
for digit in `seq 0 $num`
do
a=`ps -ef|grep -v grep |grep ${pragrom_list[$digit]}|wc -l`
if[ $a -eq 0 ];then
NotExist="$NotExist ${pragrom_list[$digit]}"
unset pragrom_list[$digit]
fi
done
#echo"pragrom_list=${pragrom_list[@]}"

####对比进程所占资源与阈值大小
cpu_use_all=""
mem_use_all=""
compare_cpu_temp=0
compare_mem_temp=0
for n in ${pragrom_list[@]}
do
cpu_use=`top -b -n1|grep $n|awk '{print $9}'`
mem_use=`top -b -n1|grep $n|awk '{print $10}'`
if[[ $cpu_use == "" ]];then
   cpu_use=0
fi
if[[ $mem_use == "" ]];then
   mem_use=0
fi

compare_cpu=`echo "$cpu_use > $cpu_crit"|bc`
compare_mem=`echo "$mem_use > $mem_crit"|bc`
if[[ $compare_cpu == 1]];then
   compare_cpu_temp=1
fi
if[[ $compare_mem == 1]];then
   compare_mem_temp=1
fi

cpu_use_all="${n}_cpu_use=${cpu_use}% ${cpu_use_all}"
mem_use_all="${n}_mem_use=${mem_use}% ${mem_use_all}"
done

###如果该变量有值，则代表有进程down。则退出值为2
if [[ "$NotExist" != ""]];then
echo -e "Current ${NotExist} isdown.$cpu_use_all;$mem_use_all"
exit 2
###如果cpu比较值为1，则代表有进程占用超过阈值，则退出值为2
elif [[ "$compare_cpu_temp" == 1]];then
echo -e "$cpu_use_all;$mem_use_all"
exit 2

##如果mem比较值为1，则代表为进程mem占用超过阈值，则退出值为2
elif [[ $compare_mem_temp == 1 ]];then
echo -e "$cpu_use_all;$mem_use_all"
exit 2
##否则则正常输出，并输出所占cpu与内存比例
else
echo -e "$cpu_use_all;$mem_use_all"
exit 0
fi

4、后话：随着近日编写shell脚本越来越多，有时难免会回改以前所写脚本，经常要看一段时间才能看懂。
为方便后续的维护，在脚本当中，每一个函数，每一段功能，都做备注，方便以后自己或他人来进行维护。

页: [1]

运维网's Archiver

Nagios插件开发之监控程序占用资源