Linux监控 Nagios
Linux监控Nagios
1 什么是监控? 监视控制
2 监控谁? 各种服务器
3 监控什么? 网络流量(eth0,eth1) 服务的状态(运行,停止)硬件资源 cpu内存 存储
系统运行情况(总数,运行,休眠,僵尸)
# uptime
09:15:02 up15min,4 users,load average: 0.01, 0.10, 0.08
(数越大,说明在线时间越长,越好)
# top(看cpu)
4 如何监控? 使用命令 编写脚本 监控软件
自动监控:
计划任务+监控脚本 (chkconfig crond on)
搭建监控服务器(软件NagiosCactiZabbix)
5如何接收报警信息(邮件,短信,微信,即时消息)
++++++++++++++++++++++++++++++++
配置监控服务器的步骤:
1.部署服务运行环境 httpd(nginx或Apache,php) yum -y install httpd php
2.安装提供服务的软件
2.1 安装准备
2.2 安装监控服务软件 和 监控插件软件
2.3 修改服务配置文件
2.4 启动监控服务
3.配置监控服务
3.1 配置监控远端服务器
3.2 配置监控自己(特殊情况,公司小只有一个网站服务器时,网站和监控服务配在一台上,所以监控本机网站)
不监控本机的交换分区
监控引导分区的使用情况
监控ftp的运行状态
指定接收报警信息的邮箱地址是nagios@localhost
4配置监控报警
5查看监控信息
设置查看访问监控页面的认证用户名和密码(不能随便指) nagiosadmin
nagios服务的监控过程:服务运行时调用插件,调用插件时,可以设置插件监控的阀值,nagios服务把监控到的值和插件指定的值比较,根据比较结果显示监控状态》
插件监控的阀值种类:警告值 错误值(值的多少由运维人员指定)
监控状态:ok正常warning警告状态critical 严重错误 unknow配置错误pending正在监控
监控到的数据小于警告值 显ok 大于 警告值且小于错误值显示warning 大于错误值 显示critical
nagios服务默认监控本机哪些资源?
cpu负载
登录系统总用户数
网站服务运行状态
主机是否在线
跟分区使用量
sshd服务运行状态
# ./check_users -h (每个插件都会提供帮助)
_____________________________________________________
———————————————————————————————————————————————————-
搭建Nagios服务器 (LAP)环境 99监控服务器真机客户机
————————————————————————————————————————————————————
_____________________________________________________
# yum -y httpd php
# service httpd restart ,chkconfig httpd on
停止 httpd: [失败]
正在启动 httpd:httpd: Could not reliably determine the server's fully qualified domain name, using 0.0.0.99 for ServerName
[确定]
# echo 123 >/var/www/html/index.html
# yum -y install elinks
# elinks --dump http://localhost
123
# vim /var/www/html/test.php
# firefox http:192.168.4.99/test.php(浏览器中输这个一样http://192.168.4.99/test.php )
# unzip nagios.zip
# cd nagios
# ls
nagios-3.2.1.tar.gz nrpe-2.12.tar.gz
nagios-plugins-1.4.14.tar.gzntop-3.3.7.tar.gz
# yum -y install gcc gcc-c++
# useradd nagios(默认服务,监控进程的用户名,组)
# groupadd nagcmd
# usermod -G nagcmd nagios
# ./configure --help
# ./configure --with-nagios-user=nagios --with-nagios-group=nagcmd --with-command-user=nagios--with-command-group=nagcmd
# make all
# make install
# ls /usr/local/nagios
binlibexecsbinsharevar
# make install-init
/usr/bin/install -c -m 755 -d -o root -g root /etc/rc.d/init.d
/usr/bin/install -c -m 755 -o root -g root daemon-init /etc/rc.d/init.d/nagios
*** Init script installed ***
# ls /etc/rc.d/init.d/nagios(启动脚本安装在这)
/etc/rc.d/init.d/nagios
# make install-commandmode
/usr/bin/install -c -m 775 -o nagios -g nagcmd -d /usr/local/nagios/var/rw
chmod g+s /usr/local/nagios/var/rw
*** External command directory configured ***
# make install-config
# ls /usr/local/nagios/etc
cgi.cfgnagios.cfgobjectsresource.cfg
# make install-webconf
# ll /etc/rc.d/init.d/nagios (启动脚本这有)
-rwxr-xr-x. 1 root root 5178 3月 9 02:00 /etc/rc.d/init.d/nagios
# ll /etc/init.d/nagios (这下面也有启动脚本)
-rwxr-xr-x. 1 root root 5178 3月 9 02:00 /etc/init.d/nagios
# /etc/init.d/nagios status (查看服务状态)
No lock file found in /usr/local/nagios/var/nagios.lock
# /etc/init.d/nagios start(启动)
Starting nagios: done.
# /etc/init.d/nagios status (再查看)
nagios (pid 6807) is running.
# http://192.168.4.99/nagios
# service httpd restart
# http://192.168.4.99/nagios打开的浏览器 选择serves选项
# vim /etc/httpd/conf.d/nagios.conf
22 AuthUserFile /usr/local/nagios/etc/htpasswd.users (查看认证用户在的文件)
# ls /usr/local/nagios/etc/htpasswd.users (文件不存在 需要创建)
ls: 无法访问/usr/local/nagios/etc/htpasswd.users: 没有那个文件
# which htpasswd
/usr/bin/htpasswd
# rpm -qf /usr/bin/htpasswd
httpd-tools-2.2.15-45.el6.x86_64
# htpasswd -h (查看帮助)
# htpasswd -c /usr/local/nagios/etc/htpasswd.users nagiosadmin (创建监控认证的用户密码,写入到文件)
New password:
Re-type new password:
Adding password for user nagiosadmin
# ls /usr/local/nagios/etc/htpasswd.users (文件创建成功)
/usr/local/nagios/etc/htpasswd.users
# cat /usr/local/nagios/etc/htpasswd.users (查看内容 用户)
nagiosadmin:SdiDPECEPUFkM
# http://192.168.4.99/nagios
输入帐号密码
按装插件:(用图形界面监控,最上面是命令操作显示)
# cd /usr/local/nagios/
# tar -zxvf nagios-plugins-1.4.14.tar.gz
# cd nagios-plugins-1.4.14
# ./configure && make && make install
# ls /usr/local/nagios/libexec/ (查看插件是否安装好)
浏览器 刷新 再点serves 等一下
监控插件的使用
/usr/local/nagios/libexec/插件名 -h
# ./check_users -w 1-c 2
USERS CRITICAL - 3 users currently logged in |users=3;1;2;0 (报错)
# ./check_http -h
# ./check_http -I 192.168.4.254
HTTP WARNING: HTTP/1.1 403 Forbidden - 5159 bytes in 0.047 second response time |time=0.047352s;;;0.000000 size=5159B;;;0
# ./check_http -I 192.168.4.254 -p 80
HTTP WARNING: HTTP/1.1 403 Forbidden - 5159 bytes in 0.002 second response time |time=0.001934s;;;0.000000 size=5159B;;;0
# ./check_http -I 192.168.4.254 -p 80
拒绝连接
HTTP CRITICAL - Unable to open TCP socket
# ./check_http -I localhost -p 80
HTTP OK: HTTP/1.1 200 OK - 271 bytes in 0.001 second response time |time=0.001469s;;;0.000000 size=271B;;;0
# ./check_ping-H 192.168.4.253-w 10,50% -c 10,60%
PING OK - Packet loss = 0%, RTA = 0.15 ms|rta=0.152000ms;10.000000;10.000000;0.000000 pl=0%;50;60;0
# ./check_disk -h
# df -h
Filesystem SizeUsed Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root
47G1.6G 43G 4% /
tmpfs 499M 0499M 0% /dev/shm
/dev/vda1 477M 36M416M 8% /boot
# ./check_disk -w 50%-c 40% -p /
DISK OK - free space: / 43791 MB (96% inode=98%);| /=1560MB;23893;28671;0;47786
# dd if=/dev/zero of=/boot/test.txt bs=1M count=400
记录了400+0 的读入
记录了400+0 的写出
419430400字节(419 MB)已复制,7.00428 秒,59.9 MB/秒
# ./check_disk -w 50%-c 40% -p /boot
DISK CRITICAL - free space: /boot 15 MB (3% inode=99%);| /boot=435MB;238;285;0;476
# ./check_disk -w 50%-c 40% -p /dev/shm
DISK OK - free space: /dev/shm 498 MB (100% inode=99%);| /dev/shm=0MB;249;298;0;498
# ./check_ssh -p 22 localhost
SSH OK - OpenSSH_5.3 (protocol 2.0)
# ./check_ssh192.168.4.254
SSH OK - OpenSSH_5.3 (protocol 2.0)
# ./check_swap
# ./check_procs -h
# ./check_procs -w 100 -c 110
PROCS OK: 100 processes
# ./check_procs -w 90 -c 110
PROCS WARNING: 100 processes
# ./check_procs -w 90 -c 95
PROCS CRITICAL: 100 processes
# ./check_procs-w 60 -c 65 -s R
PROCS OK: 0 processes with STATE = R
# ./check_procs-w 60 -c 65 -s Z
PROCS OK: 0 processes with STATE = Z
# ./check_procs-w 60 -c 65 -s ZR
PROCS OK: 0 processes with STATE = ZR
有些服务没有专属插件,可以根据端口号来指定(都是tcp):
# ./check_tcp -H 192.168.4.254 -p 80
TCP OK - 0.000 second response time on port 80|time=0.000351s;;;0.000000;10.000000
# ./check_tcp -H 192.168.4.254 -p 25
拒绝连接
# ./check_tcp -H 192.168.4.254 -p 22
TCP OK - 0.000 second response time on port 22|time=0.000243s;;;0.000000;10.000000
____修改服务配置文件可以自己增加监控服务 (先增加监控本机服务)_____________________
# ls /usr/local/nagios
binetclibexecsbinsharevar
libexec:插件目录
bin:可执行命令
etc:配置文件
sbin:cgi文件(点监控页面时出现的各种页面)
var:变化的,日志 缓存
share:配置文件
nagios服务配置文件说明?
#/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg (检查配置文件是否有错)
# vim /usr/local/nagios/etc/nagios.cfg
19 log_file=/usr/local/nagios/var/nagios.log
30 cfg_file=/usr/local/nagios/etc/objects/commands.cfg(设置服务运行时使用的监控插件)
31 cfg_file=/usr/local/nagios/etc/objects/contacts.cfg (设置接收报警消息的邮箱地址)
32 cfg_file=/usr/local/nagios/etc/objects/timeperiods.cfg (定义监控时间模板的文件—)
33 cfg_file=/usr/local/nagios/etc/objects/templates.cfg (定义监控模板的配置文件)
36 cfg_file=/usr/local/nagios/etc/objects/localhost.cfg (这个文件监控本机的配置文件,是里面的内容容器导致,不是文件名可以随变)
/usr/local/nagios/etc/resource.cfg 宏定义 $USER1$(插件定义目录)
_______________________
1)
# vim/usr/local/nagios/etc/objects/commands.cfg(设置增加监控服务使用的插件)
define command {
command_name monitor_localhost_boot
command_line/usr/local/nagios/libexec/check_disk -w 20% -c 10%-p/boot
} (这四行和下面四行只要写一个就行。这用的是常量,下面变量)
define command {
command_name monitor_localhost_boot
command_line/usr/local/nagios/libexec/check_disk -w $ARG1$ -c $ARG2$-p$ARG3$
}
define command {
command_name monitor_localhost_ftp
command_line $USER1$/check_ftp -H localhost -p 21
}
2)设置接收报警消息的邮箱地址
# /etc/init.d/postfix status
master (pid1766) 正在运行...
# vim /usr/local/nagios/etc/objects/contacts.cfg (设置接收报警消息的邮箱地址,先邮件服务开启,如果设置的是163,一定要先能163能接收到)
3)定义监控时间模板的文件
# vim /usr/local/nagios/etc/objects/timeperiods.cfg
28 define timeperiod{
29 timeperiod_name 24x7
30 alias 24 Hours A Day, 7 Days A Week
31 sunday 00:00-24:00
32 monday 00:00-24:00
33 tuesday 00:00-24:00
34 wednesday 00:00-24:00
35 thursday 00:00-24:00
36 friday 00:00-24:00
37 saturday 00:00-24:00
38 }
39
40
41 # 'workhours' timeperiod definition
42 define timeperiod{
43 timeperiod_name workhours
44 alias Normal Work Hours
45 monday 09:00-17:00
46 tuesday 09:00-17:00
47 wednesday 09:00-17:00
48 thursday 09:00-17:00
49 friday 09:00-17:00
50 }
4)定义监控模板的配置文件
# vim /usr/local/nagios/etc/objects/templates.cfg
5)(这个文件监控本机的配置文件,是里面的内容容器导致,不是文件名,文件名可以随变起)把增加的监控本机的服务写在监控本机配置文件中(在底下写)
# vim /usr/local/nagios/etc/objects/localhost.cfg
define host{
use linux-server ; Name of host template to use
; This host definition will inherit all variables that are defined
; in (or inherited by) the linux-server host template definition.
host_name localhost
alias localhost
address 127.0.0.1
}
define service{
use local-service ; Name of service template to use
host_name localhost
service_description PING
check_commandcheck_ping!100.0,20%!500.0,60%
}
.........................
.........................
##################################myset################################333
define service{
use local-service ; Name of service template to use
host_name localhost
service_description boot
check_command monitor_localhost_boot!10%!15%!/boot
notifications_enabled 0
}
define service{
use local-service ; Name of service template to use
host_name localhost
service_description ftp
check_command monitor_localhost_ftp
notifications_enabled 0
}
#/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg 测试配置文件语法是否有错
6)设置访问cgi配置用户
# cd /usr/local/nagios/sbin/
# ls
avail.cgi extinfo.cgi outages.cgistatuswml.cgitac.cgi
cmd.cgi history.cgi showlog.cgistatuswrl.cgi
config.cginotifications.cgistatus.cgi summary.cgi
# vim /usr/local/nagios/etc/cgi.cfg 设置访问cgi配置用户
119 authorized_for_system_information=nagiosadmin(登录的用户名)
# http://192.168.4.99/nagios 在浏览器中再点server刷新一下 等几分钟会再增加监控的服务
————————————————————————————————————————————————————————————————
—————————修改配置文件,增加一个监控远程主机———————————————————————————————————————
监控远端主机:
监控远端主机的公有数据192.168.4.98
监控远端主机的私有数据(服务状态)
网站 ftp 数据库 sshd
监控远端公有数据:
# /usr/local/nagios/libexec/check_http -H 192.168.4.98 -p 80
拒绝连接
HTTP CRITICAL - Unable to open TCP socket
# vim /usr/local/nagios/etc/nagios.cfg (1.在nagios主配置文件中,写上监控服务器的文件名,起服务会自动加载这)
36 cfg_file=/usr/local/nagios/etc/objects/localhost.cfg (监控本机)
37 cfg_file=/usr/local/nagios/etc/objects/otherser.cfg(监控远程的主机)
# vim /usr/local/nagios/etc/objects/otherser.sh(2.写容器定义监控远端主机的哪些服务)
define host{
use linux-server
host_name server98
address 192.168.4.98
}
define service{
use local-service
host_name server98
service_description httpd
check_command monitor_server98_httpd
}
define service{
use local-service
host_name server98
service_description sshd
check_command monitor_server98_sshd
# vim /usr/local/nagios/etc/objects/commands.cfg (3.设置监控远端服务使用的插件,)
..................................
..................................
######################monitor##################################################
define command {
command_name monitor_server98_httpd
command_line $USER1$/check_httpd -H 192.168.4.98 -p 80
}
define command {
command_name monitor_server98_ftp
command_line $USER1$/check_ftp -H 192.168.4.98 -p 21
}
测试:会查到除了监控自己也会监控远端。
# firefox http://192.168.4.99/nagios
# hostname localhost (退出再进,收邮件时,主机名如果是数字会有影响)
# /etc/init.d/nagios restart
# mail -u nagios
"/var/mail/nagios": 2 messages 2 unread
>U1 nagios@localhost.locThu Mar9 08:4432/924 "** PROBLEM Service Alert: server98/httpd is CRITICAL **"
U2 nagios@localhost.locThu Mar9 08:4732/894 "** PROBLEM Service Alert: server98/sshd is CRITICAL **"
————————————————————————————————————————————————————————————————————————————————
监控远端主机的私有数据(磁盘 进程 用户)
总数,运行,休眠,僵尸
1在被监控的主机上安装 监控插件(192.168.4.98)
#yum-y install gcc gcc-c++
250tar -zxvf nagios-plugins-1.4.14.tar.gz
cd nagios-plugins-1.4.14
257./configure
258make
259make install
262ls /usr/local/nagios/libexec/
# /usr/local/nagios/libexec/check_users -w 3 -c 5
USERS OK - 2 users currently logged in |users=2;3;5;0
#/usr/local/nagios/libexec/check_procs -w 50 -c 60-sZ
PROCS OK: 0 processes with STATE = Z
# /usr/local/nagios/libexec/check_procs -w 50 -c 60-sR
PROCS OK: 0 processes with STATE = R
# /usr/local/nagios/libexec/check_procs -w 50 -c 60 (总进程)
PROCS CRITICAL: 103 processes
# /usr/local/nagios/libexec/check_disk -w 50% -c30%-p/
DISK OK - free space: / 42137 MB (92% inode=97%);| /=3215MB;23893;33450;0;47786
2在被监控的主机上运行nrpe服务
#useraddnagios
#rpm-q openssl openssl-devel
#tar -zxvf nrpe-2.12.tar.gz
#cd nrpe-2.12
#./configure
#make all
#make install-plugin
#make install-daemon
#make install-daemon-config
#make install-xinetd
#vim /etc/xinetd.d/nrpe
only_from = 127.0.0.1172.40.50.99
:wq
#vim /etc/services
nrpe 5666/tcp# NRPE
:wq
#yum -yinstall xinetd
#servicexinetd start
# chkconfig xinetd on
#netstat -utnalp| grep :5666
3.改nrpe服务的主配置文件,设置获取本地的私有数据(192.168.4.98)
vim /usr/local/nagios/etc/nrpe.cfg
#command[命令名]=本机使用的插件
199 command=/usr/local/nagios/libexec/check_users -w 2 -c 5
200 command=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
201 command=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/vda1
202 command=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /
204 command=/usr/local/nagios/libexec/check_procs -w 5 -c 10 -s Z
205 command=/usr/local/nagios/libexec/check_procs -w 150 -c 200
:wq
# /etc/init.d/xinetd restart
验证nrpe命令(192.168.4.98)
/usr/local/nagios/libexec/check_nrpe -H localhost -ccheck_nrpe_root
4.在监控服务器上配置(192.168.4.99)
#/usr/local/nagios/libexec/check_nrpe -H 192.168.4.98 -p 5666-ccheck_nrpe_users(报错,没那个目录)
1)安装依赖包
# yum -y install openssl openssl-devel
2)
# tar -zxvfnrpe-2.12.tar.gz
# cd nrpe-2.12
#./configure
#make all
#make install-plugin
#ls /usr/local/nagios/libexec/check_nrpe
3)在命令行下测试
# /usr/local/nagios/libexec/check_nrpe -H 192.168.4.98 -p 5666 -c check_nrpe_users
USERS OK - 2 users currently logged in |users=2;2;5;0
# /usr/local/nagios/libexec/check_nrpe -H 192.168.4.98 -p 5666 -c check_nrpe_load
OK - load average: 0.00, 0.00, 0.00|load1=0.000;15.000;30.000;0; load5=0.000;10.000;25.000;0; load15=0.000;5.000;20.000;0;
4)把连接nrpe服务的插件定义成nagios服务可以使用的监控命令
# vim /usr/local/nagios/etc/objects/commands.cfg
...............................
define command {
command_name monitor_server98_boot
command_line $USER1$/check_nrpe -H 192.168.4.98 -p 5666 -c check_nrpe_boot
}
define command {
command_name monitor_server98_load
command_line $USER1$/check_nrpe -H 192.168.4.98 -p 5666 -ccheck_nrpe_load
}
define command {
command_name monitor_server98_users
command_line $USER1$/check_nrpe -H 192.168.4.98 -p 5666 -ccheck_nrpe_load
}
define command {
command_name monitor_server98_root
command_line $USER1$/check_nrpe -H 192.168.4.98 -p 5666 -ccheck_nrpe_root
}
define command {
command_name monitor_server98_zombie
command_line $USER1$/check_nrpe -H 192.168.4.98 -p 5666 -ccheck_nrpe_zombie_proc
}
define command {
command_name monitor_server98_total_procs
command_line $USER1$/check_nrpe -H 192.168.4.98 -p 5666 -ccheck_nrpe_total_procs
}
5)在监控主机配置文件中调用定义的监控命令
# vim /usr/local/nagios/etc/objects/otherser.cfg
#################################private###########################
define service{
use local-service
host_name server98
service_description users
check_command monitor_server98_users
}
define service{
use local-service
host_name server98
service_description root
check_command monitor_server98_root
}
define service{
use local-service
host_name server98
service_description total_procs
check_command monitor_server98_total_procs
}
define service{
use local-service
host_name server98
service_description boot
check_command monitor_server98_boot
}
define service{
use local-service
host_name server98
service_description load
check_command monitor_server98_load
}
define service{
use local-service
host_name server98
service_description zombie
check_command monitor_server98_zombie
}
6)测试配置文件是否有语法错误
# alias plj='/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg'
# plj
7)重启监控服务
# /etc/init.d/nagios restart
8)查看监控信息
页:
[1]