nagios艰辛的实践

wolong 发表于 2019-1-17 09:18:42

一直以来都知道nagios监控的强大就是一直没有时间弄，今天亲自的实验了一下，感谢田逸的指导！！！！
一.实验环境
两台linux主机，由于没有windows的机器也就没有做windows的监控，两台机器的IP为：
服务器端： nagios-server    192.168.4.168
被监控端： heartbeat02    192.168.4.167
二.nagios服务器端的操作
1.安装nagios的服务
1）# useradd nagios –s /sbin/nologin 添加nagios帐号同时自动生成同名组 nagios
2）# tar -zxvf nagios-3.0.3.tar.gz
# cd nagios-3.0.3
# ./configure –prefix=/usr/local/nagios ----with-nagios-user=nagios --with-nagios-group=nagios
# make all
# make install
# make install-init
# make install-config
# make install-commandmode
安装完nagios后,我们可以在安装目录/usr/local/nagios下生成下面的目录:
# ls
binetclibexecsbinsharevar
对其进行说明：
bin-----------nagios的执行程序所在目录
etc-----------nagios的配置文件所在目录
sbin----------nagios Cgi文件所在目录
share---------nagios网页文件所在目录
var-----------nagios日志文件和spid等文件所在目录
2.nagios插件nagios-plugins-1.4.12.tar.gz的安装
# tar -zxvf nagios-plugins-1.4.12.tar.gz
# cd nagios-plugins-1.4.12
# ./configure –prefix=/usr/local/nagios ----with-nagios-user=nagios --with-nagios-group=nagios
# make
# make install
安装完成后，将在目录/usr/local/nagios生成目录libexec（里面有很多文件），这正是nagios所需要.
# ls
check_apt    check_dns    check_ifoperstatuscheck_mailq    check_nt    check_pop    check_ssh ip_conn.sh
check_breeze check_dummy    check_ifstatus    check_mrtg       check_ntp    check_procs check_ssmtpnegate
check_by_ssh check_file_age check_imap       check_mrtgtraf check_ntp_peercheck_real check_swap urlize
check_clamd check_flexlm check_ircd       check_mysql    check_ntp_timecheck_rpc    check_tcp utils.pm
check_cluster check_ftp    check_jabber    check_mysql_querycheck_nwstat check_sensorscheck_time utils.sh
check_dhcp    check_hpjd    check_ldap       check_nagios    check_oracle check_simap check_udp
check_dig    check_http    check_ldaps       check_nntp       check_overcr check_smtp check_ups
check_disk    check_icmp    check_load       check_nntps    check_pgsql check_snmp check_users
check_disk_smbcheck_ide_smartcheck_log       check_nrpe       check_ping    check_spop check_wave
3.nagios的部署
1）说明
配置是nagios最复杂的部分，让我们耐心一些，逐个处理，配置成功也不是什么难事。刚安装完成的nagios，其配置文件的目录是/usr/local/nagios/etc
先把这些文件改名,如 cgi.cfg-sample改成cgi.cfg ，把后面的-sample去掉，然后可以不用修改localhost.cfg直接运行：
# /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios
验证程序能不能正常运行，配置文件如下：
# ls
cgi.cfgnagios.cfgobjects（目录）resource.cfg
其说明为：
nagios.cfg------------主配置文件(需要修改)
cgi.cfg---------------浏览器执行诸如重启nagios等（需要修改）
resource.cfg----------定义插件路径（不需要修改）
启动在objects（目录）下我们还有其他文件：
contacts.cfg hostgroups.cfgcontactgroups.cfg hosts.cfg    services.cfg（监控对象配置文件如联系人配置文件和主机配置文件等）
2）修改这些配置文件
a.修改主配置文件nagios.cfg，基于方便维护的原则，把各个配置目标单独放在文件中，如联系人信息在contacts.cfg中定义。Nagios.cfg文件比较长，我只把修改过的内容贴出来：
#注释或删掉这行
#cfg_file=/usr/local/nagios/etc/localhost.cfg

#主机配置文件路径
cfg_file=/usr/local/nagios/etc/hosts.cfg

#//主机组配置文件路径
cfg_file=/usr/local/nagios/etc/hostgroups.cfg

#联系人配置文件路径
cfg_file=/usr/local/nagios/etc/contacts.cfg

#联系组配置文件路径
cfg_file=/usr/local/nagios/etc/contactgroups.cfg

#服务配置文件路径
cfg_file=/usr/local/nagios/etc/services.cfg

#监视时段配置文件路径
cfg_file=/usr/local/nagios/etc/timeperiods.cfg

#在web界面下重启nagios、停止主机/服务检查等操作,.默认值是0.
check_external_commands=1
#根据自己的情况定这个命令检查时间间隔.默认值是1秒.
command_check_interval=10s
b)修改cgi配置文件cgi.cfg.跟修改nagios.cfg一样，只贴出被修改之处：
#如有多个用户，中间用逗号隔开
authorized_for_system_information=sery
authorized_for_configuration_information=sery
authorized_for_system_commands=sery
authorized_for_all_services=sery
authorized_for_all_hosts=nagiosadmin,sery
authorized_for_all_service_commands=sery
authorized_for_all_host_commands=sery
在这里指定的用户”sery”可以通过浏览器操纵nagios服务的关闭、重启等各种操作
c)修改commands.cfg配置文件 ,这个文件已经包含了发送邮件报警的部分，因此只需要再把短信报警和MSN报警的部分加上就可以了，这里先省略后面后单独介绍利用短信和MSN报警
在这里先介绍一下邮件报警，邮件报警很简单只需一个sendmail，启动sendmail即可.
另外还有如果在commands.cfg里没有定义check_nrpe我们还得自己定义，定义格式是：
#'check_local_nrpe' command definition
define command{
command_name check_nrpe
command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -p 5666 -c $ARG1$
}

d)新增其他配置文件
在主配置文件nagios.cfg中，我们注释了行 cfg_file=/usr/local/nagios/etc/localhost.cfg ，而使用若干单独的配置文件来定义各种对象，这样可以获得维护方便、书写规范等诸多方面的好处。这些单独的配置文件不是自然存在的，我们需要手工创建并添加内容。当然，一开始我们并不是很清楚怎么往这些文件里添加内容，只好回过头去看官方文档，天啦，太分散了，尽然不知道怎么着手了！怎么办？打开文件localhost.cfg-sample,心里基本上就有数了：无非是把这个文件拆分开来，形成多个文件嘛！下面我按新添一个主机进入监控的较优方式添加这些配置文件（当然也可以有其它的顺序，这并不影响监控的效果）。好了，我们先把nagios服务器本身给监控上，这些监控包括：主机存活、web服务监控、磁盘空间监控、负载监控、进程数监控、ip连接数监控。
(1)定义主机配置文件hosts.cfg
define host {
   host_name             nagios-server
   alias                   nagios server
   address                192.168.4.168
   contact_groups          sagroup
   check_command          check-host-alive
   max_check_attempts          4
   notification_interval       3
   notification_period       24x7
   notification_options       d,u,r
   }
说明：
联系组contact_group没有建立，需在后面的步骤完成。
主机检查命令行一般选择检查主机存活check-host-alive。
最大尝试次数最好不要设置为“1”,一般3-4次比较合理。
通知时间间隔notification_interval 根据自己实际情况设定，它的单位是分钟。
通知选项notification_options 几个值的意思是 d-down,u-unreacheable,r-recovery.
(2)定义主机组配置文件hostgroups.cfg
define hostgroup {
   hostgroup_namesa-servers
   alias          sa servers
   members    nagios-server
   }
说明：
这个配置文件不是必须的，为了在浏览器里方便归类及察看状态，可以添加这个文件。
主机组的成员必须是在hosts.cfg里已经定义了的，多个主机成员间用逗号分隔
(3)定义联系人配置文件contacts.cfg
define contact {
contact_name       sery
alias             system administrator
service_notification_period 24x7
host_notification_period    24x7
service_notification_options w,u,c,r
host_notification_options    d,u,r
service_notification_commands service-notify-by-email
host_notification_commands host-notify-by-email
email                      jlsfwq@hotmail.com
}
至于service_notification_commands和host_notification_commands 两个参数我们在下面介绍短信和MSN报警的时候会在定义短信和MSN的。
说明：
服务通知选项 w-warning,u-unknown,c-critical,r-recovery.
主机通知选项 d-down,u-unreacheable,r-recovery。
服务通知命令行及服务通知命令行在配置文件commands.cfg中得到定义，如果有报警发生，则邮件和手机短信一起发送给相关人，即下两行定义的email
收报警信息的邮件，如果让多个邮件收到邮件，每个邮件地址用逗号分开就行了。
如果这里定义的用户需要通过浏览器察看他所负责的服务器监控状态的话，还需要用apache的工具htpasswd增加同名帐号
(4)定义联系组配置文件contactgroups.cfg
define contactgroup {
   contactgroup_name sagroup
   alias             system administrator group
   members          sery
   }
说明：
当有多个人行使同样的职责时，定义成组是非常有用的。
多个成员之间用逗号分隔。
成员必须在联系人配置文件（contacts.cfg）已经定义
(5)定义服务配置文件 services.cfg
define service {
   host_name    nagios-server
   service_description check-host-alive
   check_period       24x7
   max_check_attempts 4
   normal_check_interval 3
   retry_check_interval2
   contact_groups    sagroup
   notification_interval3
   notification_period 24x7
   notification_options w,u,c,r
   check_command       check-host-alive
   }
define service {
   host_name          nagios-server
   service_description check_tcp 80
   check_period       24x7
   max_check_attempts 4
   normal_check_interval 3
   retry_check_interval2
   contact_groups    sagroup
   notification_interval 10
   notification_period 24x7
   notification_options w,u,c,r
   check_command    check_tcp!80
   }
define service{
   host_name             nagios-server
   service_description check-disk
   check_command       check_nrpe!check_df
   max_check_attempts    4
   normal_check_interval 3
   retry_check_interval 2
   check_period          24x7
   notification_interval 10
   notification_period 24x7
   notification_options w,u,c,r
   contact_groups       sagroup
   }
define service{
   host_name             nagios-server
   service_description check-load
   check_command       check_nrpe!check_load
   max_check_attempts    4
   normal_check_interval 3
   retry_check_interval 2
   check_period          24x7
   notification_interval 10
   notification_period 24x7
   notification_options w,u,c,r
   contact_groups       sagroup
   }
define service{
   host_name             nagios-server
   service_description total_procs
   check_command       check_nrpe!check_total_procs
   max_check_attempts    4
   normal_check_interval 3
   retry_check_interval 2
   check_period          24x7
   notification_interval 10
   notification_period 24x7
   notification_options w,u,c,r
   contact_groups       sagroup
   }
说明：
主机名 host_name,必须是主机配置文件hosts.cfg中定义的主机。
检查用的命令 check_command,在命令配置文件中定义或在nrpe配置文件中有定义。
最大重试次数 max_check_attempts 一般设置为3-4次比较好，这样不会因为网络闪断片刻而发生误报。
检查间隔和重试检查间隔的单位是分钟。
通知间隔指探测到故障以后，每隔多少时间发送一次报警信息。它的单位是分钟。
通知选项跟服务定义配置文件相同。
联系组contact_groups由配置文件contactgroup.cfg定义。
检查主机资源需要安装和配置nrpe,这个过程在后面完成
三.部署nrpe
1.安装nrpe
# tar -zxvf nagios-nrpe_2.8.1.orig.tar.gz
# cd nrpe_2.8.1
# ./configure –prefix=/usr/local/nrpe
# make
# make install
注：如果在其他被监控机安装nrpe，需要添加系统用户nagios.
2.复制文件
安装完nrpe后，在安装目录/usr/local/nrpe/libexec只有一个文件check_nrpe,而在nagios插件目录，却缺少这个文件，因此需要把这个文件复制到nagios插件目录；同样，因为nrpe需要调用的诸如check_disk等插件在自己的目录没有，可是这些文件确是nagios插件所存在的，所以也需要从nagios目录复制一份过来。我们把复制过程列举出来：
#cp /usr/local/nrpe/libexec/check_nrpe/usr/local/nagios/libexec
#cp /usr/local/nagios/libexec/check_disk/usr/local/nrpe/libexec
#cp /usr/local/nagios/libexec/check_load/usr/local/nrpe/libexec
#cp /usr/local/nagios/libexec/check_ping/usr/local/nrpe/libexec
#cp /usr/local/nagios/libexec/check_procs/usr/local/nrpe/libexec

3.配置nrpe
安装完nrpe以后,在安装目录并没有可用的配置文件,但我们只需把解压目录的样例文件复制到安装目录,然后修改这个文件.
# mkdir -p /usr/local/nrpe/etc
# cp /root/nrpe_2.8.1/sample-config/nrpe.cfg/usr/local/nrpe/etc
# vi /usr/local/nrpe/etc/nrpe.cfg
修改的地方如下：
#server_address=127.0.0.1--------------------》server_address=192.168.4.168 （本机IP）
allowed_hosts=127.0.0.1----------------------》allowed_hosts=192.168.4.168（本机IP）
注释掉command=/usr/local/nrpe/libexec/check_disk -w 20 -c 10 -p /dev/hda1
换成command=/usr/local/nrpe/libexec/check_disk -w 20 -c 10
4.nrpe服务的启动
（1）以独立守护进程启动nrpe服务，把他添加到/etc/rc.d/rc.local里
#/usr/local/nrpe/bin/nrpe –c /usr/local/nrpe/etc/nrpe.cfg –d
（2）通过察看系统日志，正常启动可以看到如下输出：
Mar 19 12:05:49 heartbeat01 nrpe: Starting up daemon
Mar 19 12:05:49 heartbeat01 nrpe: Listening for connections on port 5666
Mar 19 12:05:49 heartbeat01 nrpe: Allowing connections from: 192.168.4.168
（3）查看端口5666和进程nrpe有没有启动起来
# netstat -ln | grep 5666
# ps -ef | grep nrpe
(4)检查nrpe服务
# # /usr/local/nrpe/libexec/check_nrpe -H 192.168.4.168
NRPE v2.8.1
能显示版本号证明OK的
（5）通过nrpe检查主机资源
# /usr/local/nrpe/libexec/check_nrpe -H 192.168.4.168 -c check_df
DISK OK - free space: / 6852 MB (49% inode=82%); /dev/shm 124 MB (100% inode=99%);| /=7014MB;14588;14598;0;14608 /dev/shm=0MB;104;114;0;124
输出信息证明OK
四.启动nagios
1、检查配置: nagios的验证非常准确,凡是不能正确启动nagios,皆可以从错误输出找到答案.
# /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
2.启动nagios服务
# /usr/local/nagios/bin/nagios -d/usr/local/nagios/etc/nagios.cfg
或者
# service nagios start
五.安装apache配置apache
1.安装apache，利用的lampp
tar -zxvf xampp-linux-1.6.3a.tar -C /opt

2.配置apache
# vi /opt/lampp/etc/httpd.conf
在最后加上如下内容：
ScriptAlias /nagios/cgi-bin /usr/local/nagios/sbin

   AuthType Basic
   Options ExecCGI
   AllowOverride None
   Order allow,deny
   Allow from all
   AuthName "Nagios Access"
   AuthUserFile /usr/local/nagios/etc/htpasswd
   Require valid-user

Alias /nagios /usr/local/nagios/share

   AuthType Basic
   Options None
   AllowOverride None
   Order allow,deny
   Allow from all
   AuthName "nagios Access"
   AuthUserFile /usr/local/nagios/etc/htpasswd
   Require valid-user

3.添加apache用户验证
# /opt/lampp/bin/htpasswd -c /usr/local/nagios/etc/htpasswdsery
输入密码即可
4.启动apache
# /opt/lampp/lampp startapache/ stopapache启动和停止apache
#/opt/lampp/lampp php5          使用php5
然后在别的机器输入url:http://192.168.4.168/nagios/
输入用户名和密码进入可以看到nagios的界面，在Service Detail里可以看到监控的主机和监控的项

六.nagios监控其他linux主机（被监控的主机上的操作）
第一步：先搭建nagios-client端（被监控端）的环境
1.建立nagios用户
# useradd nagios -s /sbin/nologin
2.安装nagios-plugins-1.4.12.tar.gz插件
# ./configure –prefix=/usr/local/nagios --with-nagios-user=nagios --with-nagios-group=nagios
# make
# make install
3.安装nrpe
# tar -zxvf nagios-nrpe_2.8.1.orig.tar.gz
# cd nrpe_2.8.1
# ./configure –prefix=/usr/local/nrpe
# make
# make install
4.复制文件
安装完nrpe后，在安装目录/usr/local/nrpe/libexec只有一个文件check_nrpe,而在nagios插件目录，却缺少这个文件，因此需要把这个文件复制到nagios插件目录；同样，因为nrpe需要调用的诸如check_disk等插件在自己的目录没有，可是这些文件确是nagios插件所存在的，所以也需要从nagios目录复制一份过来。我们把复制过程列举出来：
#cp /usr/local/nrpe/libexec/check_nrpe/usr/local/nagios/libexec
#cp /usr/local/nagios/libexec/check_disk/usr/local/nrpe/libexec
#cp /usr/local/nagios/libexec/check_load/usr/local/nrpe/libexec
#cp /usr/local/nagios/libexec/check_ping/usr/local/nrpe/libexec
#cp /usr/local/nagios/libexec/check_procs/usr/local/nrpe/libexec

5.配置nrpe
安装完nrpe以后,在安装目录并没有可用的配置文件,但我们只需把解压目录的样例文件复制到安装目录,然后修改这个文件.
# mkdir -p /usr/local/nrpe/etc
# cp /root/nrpe_2.8.1/sample-config/nrpe.cfg/usr/local/nrpe/etc
# vi /usr/local/nrpe/etc/nrpe.cfg
修改的地方如下：
server_address=127.0.0.1------------------> server_address=192.168.2.14
allowed_hosts=127.0.0.1-------------------->allowed_hosts=127.0.0.1,192.168.4.168（添加上nagios的IP）
注释掉command=/usr/local/nrpe/libexec/check_disk -w 20 -c 10 -p /dev/hda1
换成command=/usr/local/nrpe/libexec/check_disk -w 20 -c 10
6.nrpe服务的启动
（1）以独立守护进程启动nrpe服务，把他添加到/etc/rc.d/rc.local里
#/usr/local/nrpe/bin/nrpe –c /usr/local/nrpe/etc/nrpe.cfg –d
（2）通过察看系统日志，正常启动可以看到如下输出：
Mar 19 12:05:49 heartbeat01 nrpe: Starting up daemon
Mar 19 12:05:49 heartbeat01 nrpe: Listening for connections on port 5666
Mar 19 12:05:49 heartbeat01 nrpe: Allowing connections from: 127.0.0.1,192.168.4.168
（3）查看端口5666和进程nrpe有没有启动起来
# netstat -ln | grep 5666
# ps -ef | grep nrpe
(4)检查nrpe服务
# # /usr/local/nrpe/libexec/check_nrpe -H 127.0.0.1
NRPE v2.8.1
能显示版本号证明OK的,同时在nagios-server端执行：
# # /usr/local/nrpe/libexec/check_nrpe -H 192.168.4.169
NRPE v2.8.1
能显示版本号证明nrpe服务没有问题
第二步：在nagios-server端搭建监控nagios-client端的环境
（1）在hosts.cfg 里定义主机信息，例如：
define host {
   host_name                192.168.4.169
   alias                   heartbeat02
   address                192.168.4.169
   contact_groups          sagroup
   check_command          check-host-alive
   max_check_attempts          4
   notification_interval       3
   notification_period       24x7
   notification_options       d,u,r
   }
(2).在services.cfg里定义被监控的服务，例如：
define service {
   host_name          192.168.4.169
   service_description check-host-alive
   check_period       24x7
   max_check_attempts 4
   normal_check_interval 3
   retry_check_interval2
   contact_groups    sagroup
   notification_interval3
   notification_period 24x7
   notification_options w,u,c,r
   check_command       check-host-alive
   }
define service {
   host_name          192.168.4.169
   service_description check_tcp 80
   check_period       24x7
   max_check_attempts 4
   normal_check_interval 3
   retry_check_interval2
   contact_groups    sagroup
   notification_interval 10
   notification_period 24x7
   notification_options w,u,c,r
   check_command    check_tcp!80
   }
define service {
   host_name          192.168.4.169
   service_description check_tcp 8080
   check_period       24x7
   max_check_attempts 4
   normal_check_interval 3
   retry_check_interval2
   contact_groups    sagroup
   notification_interval 10
   notification_period 24x7
   notification_options w,u,c,r
   check_command    check_tcp!8080
   }
define service{
   host_name             192.168.4.169
   service_description check-disk
   check_command       check_nrpe!check_df
   max_check_attempts    4
   normal_check_interval 3
   retry_check_interval 2
   check_period          24x7
   notification_interval 10
   notification_period 24x7
   notification_options w,u,c,r
   contact_groups       sagroup
   }
define service{
   host_name             192.168.4.169
   service_description check-load
   check_command       check_nrpe!check_load
   max_check_attempts    4
   normal_check_interval 3
   retry_check_interval 2
   check_period          24x7
   notification_interval 10
   notification_period 24x7
   notification_options w,u,c,r
   contact_groups       sagroup
   }
define service{
   host_name             192.168.4.169
   service_description total_procs
   check_command       check_nrpe!check_total_procs
   max_check_attempts    4
   normal_check_interval 3
   retry_check_interval 2
   check_period          24x7
   notification_interval 10
   notification_period 24x7
   notification_options w,u,c,r
   contact_groups       sagroup
}
define service{
   host_name             192.168.4.169
   service_description Current Users
   check_command       check_nrpe!check_users
   max_check_attempts    4
   normal_check_interval 3
   retry_check_interval 2
   check_period          24x7
   notification_interval 10
   notification_period 24x7
   notification_options w,u,c,r
   contact_groups       sagroup
}
（3）启动nagios和apache
# service nagios start
#/usr/local/apache2/bin/apachectl restart
(4) 打开http://192.168.4.168/nagios/查看被添加的主机
（5）如果添加监控别的端口的话，参照监控80端口的配置文件个添加别的主机和上面方法一样.

页: [1]

运维网's Archiver

nagios艰辛的实践