Nagios远程监控软件的安装与配置详解(3)

chunjihong 发表于 2017-4-20 09:52:54

　　2、增加新的配置文件
先创建简单的配置文件timeperiods.cfg，其内容如下：

define timeperiod{
timeperiod_name 24x7
alias       24 Hours A Day, 7 Days A Week
sunday       00:00-24:00
monday       00:00-24:00
tuesday       00:00-24:00
wednesday    00:00-24:00
thursday    00:00-24:00
friday       00:00-24:00
saturday    00:00-24:00
}

　　这个文件的定义明晰易懂，不多做说明。另建议7X24小时监控。
第二个手动创建的配置文件是 contacts.cfg,其格式如下：

define contact {
contact_name       sa //不要有空格
alias             system administrator
service_notification_period 24x7
host_notification_period    24x7
service_notification_options w,u,c,r
host_notification_options    d,u,r
service_notification_commandsservice-notify-by-sms,service-
notify-by-email//这个命令读配置文件miscommands.cfg
host_notification_commands host-notify-by-email,host-noti
fy-by-sms    //这个命令读配置文件miscommands.cfg
email                      sery@163.com
pager                      13333333333 //手机号，收报警短信
} //不要把这个符号写掉了
define contact {
contact_name       sery
alias             system administrator
service_notification_period 24x7
host_notification_period    24x7
service_notification_options w,u,c,r
host_notification_options    d,u,r
service_notification_commandsservice-notify-by-sms,service-
notify-by-email
host_notification_commands host-notify-by-email,host-noti
fy-by-sms
email                      sery@sohu.com
pager                      13312345678
}

　　上面的文件定义了2个联系人，如果有更多联系人的话，照这个格式在后面追加即可。服务通知选项（service_notification_options）与主机通知选项（host_notification_options）的几个选项在这里说明一下：w-warning , u-unknown,c-critical,r-recovery;d-down,u-unreachable,注意一下，主机报警和服务报警有些差异。
　　紧接着的第三个手动创建的配置文件是contactgroups.cfg文件，这个文件是依照上一个文件contacts.cfg来的,contactgroups文件相对简单一些，其格式如下：

define contactgroup {
contactgroup_name sagroup//不要用空格
alias             system administrator group
members          sa,sery//本例有2个成员
}

　　多个成员之间用逗号做分界符，如果有更多的联系组，就依相同的格式在文件中追加余下的组。
关键的角色终于登场，这就是配置文件hosts.cfg。下面是我定义的两个主机的基本样式：

#define monitorhost
#################################################################
# Wangjing IDC servers                                        #
#################################################################
define host {
host_name                nagios-server
alias                   nagios server
address                61.x..x.49
contact_groups          sagroup //多个联系组用逗号分隔，
数据来源于contactgroups.cfg
check_command          check-host-alive
max_check_attempts       5
notification_interval    10 //值可调，大小什么值合适需自己测定
notification_period    24x7
notification_options    d,u,r
}
define host {
host_name                24-25
alias                   server 24-25
address                202.X.24.25
contact_groups          sagroup
check_command          check-host-alive //down机就发报警通知
max_check_attempts       5
notification_interval    10
notification_period    24x7
notification_options    d,u,r
}

　　更多的主机依此格式逐个追加进来。小技巧，如果是连续的ip段，最好自己写个脚本生成hosts.cfg文件，为了以后维护方便，尽可能在文件中使用易读的注释（如本例# Wangjing IDC servers #）。
　　再一个重量级的配置文件是services.cfg,没有这个文件，什么监控也没用。下面给出一个样式文件：

#service definition
##############################################################
#Wangjing IDC servers service for host-live             #
##############################################################
define service {
host_name    nagios-server//来源：hosts.cfg
service_description check-host-alive
check_period       24x7
max_check_attempts 4
normal_check_interval 3
retry_check_interval2
contact_groups    sagroup//来源：contactgroups.cfg
notification_interval 10
notification_period 24x7
notification_options w,u,c,r
check_command       check-host-alive//检查主机是否存活
}
define service {
host_name    74-210
service_description check_tcp 80
check_period       24x7
max_check_attempts 4
normal_check_interval 3
retry_check_interval2
contact_groups    sagroup
notification_interval 10
notification_period 24x7
notification_options w,u,c,r
check_command    check_tcp!80 //检查tcp 80端口服务是否正常
}

　　书写时要注意的是，check_tcp与要监控的服务端口之间要用”!”做分隔符。如果服务太多，以应该考虑用脚本来生成。
主机组配置文件hostgroups.cfg，这是一个可选的项目，它建立在文件hosts之上，其格式如下：

define hostgroup {
hostgroup_namesa-servers
alias       sa servers
members       nagios-server,24-25,24-26//用逗号间隔多个主机
}

　　多个主机组依上面的格式逐个追加上去。后面给一个主机组的截图。
http://new.iyunv.com/files/uploadimg/20070604/1716283.gif
　　千辛万苦，终于把这些配置给做好保存，现在几乎有点迫不及待了，运行程序/usr/local/nagios –v /usr/local/nagios/etc/nagios.cfg来检查所有配置文件的正确性。如果十分幸运的话，运行完毕将在输出尾部出现：

Total Warnings: 0
Total Errors: 0
Things look okay - No serious problems were detected during the pre-flight check

　　这样的情况，大功告成；但我却没有这么幸运，修改了好多个地方才成功。不过值得庆幸的是，这个校验的错误报告时非常有用的（不象有的系统的帮助文档中看不中用）。看我故意设置的一个错误产生的输出：

# bin/nagios -v etc/nagios.cfg
Nagios 2.5
Copyright (c) 1999-2006 Ethan Galstad (http://www.nagios.org)
Last Modified: 07-13-2006
License: GPL
Reading configuration data...
Error: Could not find any host matching 'nagios-server'
Error: Could not expand member hosts specified in hostgroup
(config file '/usr/local/nagios/etc/hostgroups.cfg', starting on line 2)
………………………

　　它告诉我配置文件在什么位置产生错误（实际上我故意在配置文件里加了一个注释符号来测试）。验证通过以后，就可以执行命令/usr/local/nagios –d /usr/local/nagios/etc/nagios.cfg 把nagios作为守护进程。然后用ps –aux | grep nagios 看进程是否处于运行状态。到这一步，nagios服务基本上算是配置完毕。做hosts.cfg、services.cfg等配置时，可以运用一些小技巧来减少出错的概率：如先定义少许的主机、服务，待校验无误后再追加。
　　验收
　　用浏览器输入nagios所在服务器的ip及目录，如http://61.135.X..X/nagios，再输验证所需的用户名和密码，就可点击页面右边的相关连接来查看各种状态。关掉某个被nagios监控主机的服务或者拔掉某个服务器的网线，等几分钟，点击超连接“Service Detail”观察页面状态看是否有红色的醒目的报警出现。
http://new.iyunv.com/files/uploadimg/20070604/1716284.gif
　　一会儿，就会收到报警短信和报警邮件，然后在把测试所有的服务开启或把拔下来的网线查上去，片刻后，网页里的红色报警表格消失，手机短信或邮件通知故障恢复。如果你的情况也这样，那么真正大功告成。
Nagios 的功能十分强大，在我的项目里，因为我的需求不同而尽可能的简化了nagios而没有使用代理、多更多插件等功能，在一个不超过1000个服务器的网络规模里，它工作得很好。如果有更多的服务器，建议使用mysql数据来管理监控对象。在部署nagios的过程中，我多很多选项作了取舍，更详细的情况请参照官方的文档。

页: [1]

运维网's Archiver

Nagios远程监控软件的安装与配置详解(3)