Nagios安装与部署监控实例
[*] 安装前的准备
3台装有rhel6.2x64系统的机器,其中一台作为服务端(192.168.5.203),另两台为被监控端(192.168.5.204装有http服务并打开服务和192.168.5.206装有mysql服务并打开服务)
注:192.168.5.204监控http服务,192.168.5.206监控mysql服务
服务端要用的安装包:nagios-3.2.3.tar.gz
nagios-plugins-1.4.14.tar.gz
httpd-2.2.23.tar.bz2
php-5.4.10.tar.gz
nrpe-2.12.tar.gz
下载地址:http://pan.baidu.com/s/1c0lHEH6
两个客户端要使用的安装包:nagios-plugins-1.4.14.tar.gz
nrpe-2.12.tar.gz
在服务端:
1)创建nagios用户和用户组
# pwd
/root
# useradd -s /sbin/nologin nagios
# mkdir /usr/local/nagios
# chown -R nagios.nagios /usr/local/nagios/
2)开始系统的sendmail服务
# /etc/init.d/sendmail start
只需开启sendmail服务,无需配置
2.编译安装
# tar zxvf nagios-3.2.3.tar.gz
# cd nagios-3.2.3
# ./configure --prefix=/usr/local/nagios
# make all
# make install
# make install-init
# make install-commandmode
# make install-config
# chkconfig --add nagios
# chkconfig --level 35 nagios on
#echo "/usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg">>/etc/rc.local
3.安装nagios插件
# tar nagios-plugins-1.4.14.tar.gz
# cd nagios-plugins-1.4.14
# ./configure --prefix=/usr/local/nagios
# make
# make install
4.安装Apache和php
# tar jxvf httpd-2.2.23.tar.bz2
# cd httpd-2.2.23
# ./configure --prefix=/usr/local/apache2
# make &&make install
# tar zxvf php-5.4.10.tar.gz
# cd php-5.4.10
# ./configure --prefix=/usr/local/php \
> --with-gd --with-zlib --with-apxs2=/usr/local/apache2/bin/apxs
# make && make install
配置Apache
1)首先在/usr/local/apache2/conf/httpd.conf 中修改apache进程的启动用户为nagios
修改为:(大概在第67行)
User nagios
Group nagios
2)然后找到 DirectoryIndex(大概在168行 )
DirectoryIndex index.html index.php
3)增加如下内容(大概在311行增加)
AddType application/x-httpd-php .php
4)授权访问nagios的web监控界面,需要增加验证配置,在http.conf文件的最后添加如下信息:
ScriptAlias /nagios/cgi-bin "/usr/local/nagios/sbin"
AuthType Basic
Options ExecCGI
AllowOverride None
Order allow,deny
Allow from all
AuthName "Nagios Access"
AuthUserFile /usr/local/nagios/etc/htpasswd
Require valid-user
Alias /nagios "/usr/local/nagios/share"
AuthType Basic
Options None
AllowOverride None
Order allow,deny
Allow from all
AuthName "nagios Access"
AuthUserFile /usr/local/nagios/etc/htpasswd
Require valid-user
5)创建Apache目录验证文件htpasswd (用户名和密码任意,本次使用ixdba)
# /usr/local/apache2/bin/htpasswd \
> -c /usr/local/nagios/etc/htpasswd ixdba
New password:
Re-type new password:
Adding password for user nagios
6)启动apache服务
# /usr/local/apache2/bin/apachectl start
#echo "/usr/local/apache2/bin/apachectl start" >>/etc/rc.local
3.在服务端(192.168.5.203)安装NRPE外部构件监控远程主机
# tar zxvf nrpe-2.12.tar.gz
# cd nrpe-2.12
# make all
# make install-plugin
#echo "/usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d">>/etc/rc.local
4.在两台被监控端安装nagios客户端和NRPE
1)在被监控机上(192.168.5.204)安装nagios-plugins
# useradd -s /sbin/nologin nagios
# tar zxvf nagios-plugins-1.4.14.tar.gz
root@localhost ~]# cd nagios-plugins-1.4.14
# ./configure
# make
# make install
# chown nagios.nagios /usr/local/nagios/
# chown -R nagios.nagios /usr/local/nagios/libexec/
2)在被监控机上(192.168.5.204)安装nrpe
# tar zxvf nrpe-2.12.tar.gz
# cd nrpe-2.12
# ./configure
# make all
# make install-plugin
# make install-daemon
# make install-daemon-config
#echo "/usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d">>/etc/rc.local
注:在192.168.5.206 重复1)2)
3)在被监控机上(192.168.5.204)修改 /usr/local/nagios/etc/nrpe.cfg 中(79行)修改为
allowed_hosts=127.0.0.1,192.168.5.203 (中间有个逗号,不要有空格)
并启动nrpe进程,如下表示启动成功,默认端口号5666
# /usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d
# ps -ef | grep nrpe
nagios 21885 10 Sep09 ? 00:00:08 /usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d
# netstat -tunl | grep 5666
tcp 0 0 0.0.0.0:5666 0.0.0.0:* LISTEN
4)在服务端(192.168.5.203)上测试与客户端能否正常通信,执行命令如下,出现版本号表明,服务端可以与客户端正常通信。
# /usr/local/nagios/libexec/check_nrpe -H 192.168.5.204
NRPE v2.12
# /usr/local/nagios/libexec/check_nrpe -H 192.168.5.206
NRPE v2.12
5)在服务端(192.168.5.203)定义一个check_nrpe监控命令
# vim /usr/local/nagios/etc/objects/commands.cfg
definecommand{
command_name check_nrpe
command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$
}
6)在被监控机(192.168.5.204)上定义新增加监控服务器内容
用/usr/local/nagios/libexec/check_tcp 这个命令脚本, -p 80 端口,10是端口超时时间秒(204行)
# vim /usr/local/nagios/etc/nrpe.cfg
command=/usr/local/nagios/libexec/check_users -w 5 -c 10
command=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
command=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/hda1
command=/usr/local/nagios/libexec/check_procs -w 5 -c 10 -s Z
command=/usr/local/nagios/libexec/check_procs -w 150 -c 200
command=/usr/local/nagios/libexec/check_tcp -p 80 10
注:每次修改nrpe.cfg后,都要重启nrpe进程才能生效:杀死进程,再启动进程
# ps -ef | grep nrpe
nagios 6508 10 09:32 ? 00:00:00 /usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d
# kill 6508
# /usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d
在被监控机(192.168.5.206)上定义check_tcp3306是命令名称,使用/usr/local/nagios/libexec/check_tcp 这个命令脚本,-p 3306端口,10 是端口超时时间秒(204行)
command=/usr/local/nagios/libexec/check_users -w 5 -c 10
command=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
command=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/hda1
command=/usr/local/nagios/libexec/check_procs -w 5 -c 10 -s Z
command=/usr/local/nagios/libexec/check_procs -w 150 -c 200
command=/usr/local/nagios/libexec/check_tcp -p 3306 5
7)在服务端(192.168.5.203)进行命令测试是否能够检测到,出现TCP OK表明正确
# /usr/local/nagios/libexec/check_nrpe -H 192.168.5.204 -c check_tcp80
TCP OK - 0.000 second response time on port 80|time=0.000421s;;;0.000000;10.000000
# /usr/local/nagios/libexec/check_nrpe -H 192.168.5.206 -c check_tcp3306
TCP OK - 0.000 second response time on port 3306|time=0.000431s;;;0.000000;10.000000
4.在服务端(192.168.5.203)添加被监控主机和监控服务
1)templates.cfg (默认定义,无需编辑)
位置 /usr/local/nagios/etc/objects/templates.cfg
2)resource.cfg(只有一行,大概是第26行,默认是下面这一行)
#vim /usr/local/nagios/etc/resource.cfg
$USER1$=/usr/local/nagios/libexec
3)commands.cfg(已在上面定义了check_nrpe的命令,无需再编辑)
4)host.cfg(默认没有,需要手动创建,此文件定义监控主机的名字和IP,注意不要忘记上下大括号)
# pwd
/usr/local/nagios/etc/objects
# vim /usr/local/nagios/etc/objects/hosts.cfg
define host{
use linux-server ;默认写linux-server, 在templates.cfg中默认定义
host_name web ;这个主机名可以任意命名
alias ixdba-web ;别名任意命名
address 192.168.5.204 ;被监控机地址
}
define host{
use linux-server
host_name mysql
alias ixdba-mysql
address 192.168.5.206
}
definehostgroup{ ;定义主机组
hostgroup_namesa-server ;主机组名称任意命名
alias sa server ;主机别名
members web,mysql ;上面定义的两个主机
}
5)services.cfg(默认没有,需手动创建,此文件用来定义被监控主机的服务)
# vim /usr/local/nagios/etc/objects/services.cfg
define service{
use local-service ;使用默认local-service,已在templates.cfg中默认定义
host_name web ;web主机,即192.168.5.204,已在hosts.cfg中定义
service_description PING ;监控内容描述,名称意思接近服务即可,任意
check_command check_ping!100.0,20%!500.0,60%
} ;使用服务端的chek_ping 此命令组合从左到右一次为命令!告警时延,丢包率!严重告警时延,丢包率
define service{
use local-service ;使用默认local-service,已在templates.cfg中默认定义
host_name web ;web主机,即192.168.5.204,已在hosts.cfg中定义
service_description web80;监控内容描述,名称意思接近服务即可,任意
check_command check_nrpe!check_tcp80 ;命令已在被监控机nrpe.cfg中定义
}
define service{
use local-service
host_name mysql
service_description PING
check_command check_ping!100.0,20%!500.0,60%
}
defineservice{
use local-service
host_name mysql
service_description mysql3306
check_command check_nrpe!check_tcp3306
}
defineservicegroup{ ;定义服务组,不是重点
servicegroup_name servergroup
alias server-group
members web,PING,web,web80,mysql,PING,mysql,mysql3306
}
~
6)contacts.cfg(定义联系人和联系人组)
# vim /usr/local/nagios/etc/objects/contacts.cfg
define contact{
contact_name nagiosadmin ; 联系人名称,使用默认即可 use generic-contact ; 使用generic-contact的属性信息,已在templates.cfg中定义
alias Nagios Admin ; Full name of user
email 15901392876@139.com ; 邮箱(建议移动,设置短信提醒)
}
define contactgroup{
contactgroup_name admins ;联系人组名称 ;使用默认
alias Nagios Administrators
members nagiosadmin
}
7)timeperiods.cfg(定义监控时间段,已默认定义,无需改动)
# vim /usr/local/nagios/etc/objects/timeperiods.cfg
define timeperiod{
timeperiod_name 24x7
alias 24 Hours A Day, 7 Days A Week
sunday 00:00-24:00
monday 00:00-24:00
tuesday 00:00-24:00
wednesday 00:00-24:00
thursday 00:00-24:00
friday 00:00-24:00
saturday 00:00-24:00
}
8)cgi.cfg(此文件用来控制相关CGI脚本,只需在此文件添加用户的执行权限)
# vim /usr/local/nagios/etc/cgi.cfg
default_user_name=ixdba
authorized_for_system_information=ixdba
authorized_for_configuration_information=ixdba
authorized_for_system_commands=ixdba
authorized_for_all_services=ixdba
authorized_for_all_hosts=ixdba
authorized_for_all_service_commands=ixdba
authorized_for_all_host_commands=ixdba
9)nagios.cfg(nagios的核心配置文件)
# vim /usr/local/nagios/etc/nagios.cfg
cfg_file=/usr/local/nagios/etc/objects/commands.cfg
cfg_file=/usr/local/nagios/etc/objects/contacts.cfg
cfg_file=/usr/local/nagios/etc/objects/timeperiods.cfg
cfg_file=/usr/local/nagios/etc/objects/templates.cfg
cfg_file=/usr/local/nagios/etc/objects/localhost.cfg
cfg_file=/usr/local/nagios/etc/objects/hosts.cfg(添加)
cfg_file=/usr/local/nagios/etc/objects/services.cfg (添加)
use_authentication=1 #0改成1,大概78行
5.验证nagios配置文件的正确性
# /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
根据提示在错误在哪个文件的第几行有错误,而适当修改,(配置正确提示 警告0,错误0)
Checking for circular paths between hosts...
Checking for circular host and service dependencies...
Checking global event handlers...
Checking obsessive compulsive processor commands...
Checking misc settings...
Total Warnings: 0
Total Errors: 0
# service nagios start
6.登录监控界面 http://192.168.5.203/nagios输入用户名ixdba和密码
http://s3.运维网.com/wyfs02/M02/49/49/wKioL1QSar2jdDcDAAOeAg9VScY330.jpg
点击 Services会看到监控服务,其中 localhost是默认监控本地的服务,会看到mysql(192.168.5.206)和web(192.168.5.204)的监控服务。
7.模拟web的http程序异常,等待出现报警
#service httpd stop
http://s3.运维网.com/wyfs02/M00/49/48/wKiom1QScJfDffH_AAHqBHL6iHk594.jpg
并有报警邮件和短信提醒
http://s3.运维网.com/wyfs02/M02/49/4A/wKioL1QScPOgNRuDAAFFCUUTsrc369.jpg
8.模拟恢复web
#service httpd start
http://s3.运维网.com/wyfs02/M00/49/4A/wKioL1QSccGxGL39AAI4XRUx96U025.jpg
并有恢复邮件通知和短信提醒
http://s3.运维网.com/wyfs02/M00/49/48/wKiom1QScdrARrwxAAGfH5zSuEw902.jpg
注:虽然已经实现了服务的监控、报警、和报警邮件、短信。但是发现从web故障发(11:29)生到报警时间(12:01),30分钟时间。这时间是不能忍的
所以还要对nagios做一些检查的优化。
在templates.cfg文件中修改
# vim /usr/local/nagios/etc/objects/templates.cfg
http://s3.运维网.com/wyfs02/M00/49/4A/wKioL1QSdPXC-1_ZAAEqcQ2JtYs724.jpg
72 check_interval 是对主机的检查时间间隔,改成1(单位分钟)
73 retry_interval 是重试检查时间间隔,改成1(单位分钟)
74 max_check_attempts 是对主机的最大检查次数,改成1次
76 notification_period 故障时发送通知的时间范围,改成24x7
http://s3.运维网.com/wyfs02/M00/49/48/wKiom1QSdf_hXAjlAAIrdKnn2rk120.jpg
169 max_check_attempts 对服务的最大检查次数,改成 2 (分钟)
170 normal_check_interval 对服务检查时间间隔,改成 1 (分钟)
171 retry_check_interval重试检查时间间隔改成1 (分钟)
http://s3.运维网.com/wyfs02/M00/49/4A/wKioL1QSdqmSsJ7YAADI1VCWXQY132.jpg
185 max_check_attempts 对服务的最大检查次数 改成 2(分钟)
186 normal_check_interval 对服务检查时间间隔改成1(分钟)
187 retry_check_interval重试检查时间间隔改成1 (分钟)
9.再模拟一次故障,报警时间就快很多。
页:
[1]