Exadata使用技巧 (二)

zhuyumu 发表于 2017-6-25 06:19:09

1. Exadata硬件篇
1.1 常规
　　默认密码，以下是Exadata中cell/db node IB等的默认密码：
　　组件
　　登陆
　　默认密码
　　Storage Cells
　　root nm2user
　　welcome1
　　Infiniband Switch
　　root nm2user
　　welcome1 changeme
　　DB节点
　　root
　　welcome1
　　CELL CLI
　　celladmin
　　welcome
　　ILOM
　　root
　　welcome1
　　KVM Switch
　　Admin or none
　　<none>
　　GigE switch
　　<none>
　　<none>
　　初始安装后asmsnmp的账号一般也是welcome1
1.2 硬件常规巡检
　　在机房例行检查时，需要从Exadata机箱后方查看Exadata中是否有黄灯报警，如果有，记录位
　　置，即时登录OEM/ILOM/集成的第三方监控工具查明原因，定位部件，即时维护。
　　Exadata一体机健康检查脚本exachk，参考document 1070954.1
　　检测Exadata数据库机器上的硬件和固件版本是否匹配？
　　/opt/oracle.SupportTools/CheckHWnFWProfile
　　返回如下结果说明版本匹配：
　　 The hardware and firmware profile matches one of the supported profile
　　检测软件版本与平台是否匹配？
　　/opt/oracle.SupportTools/CheckSWProfile.sh –c
1.3 cell启用邮件告警
　　ALTER CELL smtpServer=’mailserver.maildomain.com’, – smtpFromAddr=’firstname.lastname@maildomain.com’, –
　　smtpToAddr=’firstname.lastname@maildomain.com’, –
　　smtpFrom=’Exadata cell’, –
　　smtpPort='<port for mail server>’, – smtpUseSSL=’TRUE’, – notificationPolicy=’critical,warning,clear’, – notificationMethod=’mail’;
　　alter cell validate mail;
1.4 监控磁盘故障
　　当通过机房例行检查发现硬件黄灯警告或通过监控工具(命令行/ILOM/第三方工具)发现故
　　障并确定位置后,可进行更换操作。
1.5 更换Storage Cell硬盘
　　命令行登录Cell,判断故障硬盘,例如:
　　CellCLI> LIST PHYSICALDISK WHERE diskType=HardDisk AND status=critical DETAIL
1.6 检查Database Server 磁盘状态
　　# cd /opt/MegaRAID/MegaCli/
　　# ./MegaCli64 -Pdlist -aAll | grep “Slot\|Firmware”
　　若发现Exadata上存在磁盘损毁则：
　　使用/opt/oracle.SupportTools/sundiag.sh 收集详细信息并发给oracle support
1.7 检查Database Server RAID状态
　　# ./MegaCli64 -LdInfo -lAll –aAll
1.8 Storage Cell启动
　　远程登陆Storage Cell控制器ILOM,执行Power On,其它为系统的自动启动过程,知道Storage Cell就绪
　　CellCLI> LIST GRIDDISK
　　若没有Active,需:
　　CellCLI> ALTER GRIDDISK ALL ACTIVE
　　等grid disk Active后,ASM会自动同步,使grid disk Online,查看状态: CellCLI> LIST GRIDDISK ATTRIBUTES name, asmmodestatus
　　确认ASM数据自动重新分布是否已经开始或完成。 Grid用户登录+ASM实例执行:
　　select * from v$asm_operation; 通过EM、SYSLOG、Cellcli、ILOM查看是否有告警解除信息
1.9 检测memory ECC错误
　　ipmitool sel list | grep ECC | cut -f1 -d : | sort –u
1.10 检测 cell server Cache Policy
　　cell08# MegaCli64 -LDInfo -Lall -aALL | grep 'Current Cache Policy'
　　Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU
　　cell09# MegaCli64 -LDInfo -Lall -aALL | grep 'Current Cache Policy'
　　Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
　　Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
　　Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU
　　Cache policy is in WB
　　Would recommend proactive battery repalcement.
　　Example :
　　a. /opt/MegaRAID/MegaCli/MegaCli64 -LDGetProp -Cache -LALL -aALL ####( Will list the cache policy)
　　b. /opt/MegaRAID/MegaCli/MegaCli64 -LDSetProp -WB -LALL -aALL ####( Will try to change teh policy from xx to WB)
　　So policy Change to WB will not come into effect immediately
　　Set Write Policy to WriteBack on Adapter 0, VD 0 (target id: 0) success
　　Battery capacity is below the threshold value
　　检测cell BBU备用电池状态：
　　cell08# /opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -GetBbuStatus -a0
　　BBU status for Adapter: 0
　　BatteryType: iBBU
　　Voltage: 4061 mV
　　Current: 0 mA
　　Temperature: 36 C
　　BBU Firmware Status:
　　Charging Status : None
　　Voltage : OK
　　Temperature : OK
　　Learn Cycle Requested : No
　　Learn Cycle Active : No
　　Learn Cycle Status : OK
　　Learn Cycle Timeout : No
　　I2c Errors Detected : No
　　Battery Pack Missing : No
　　Battery Replacement required : No
　　Remaining Capacity Low : Yes
　　Periodic Learn Required : No
　　Battery state:
　　GasGuageStatus:
　　Fully Discharged : No
　　Fully Charged : Yes
　　Discharging : Yes
　　Initialized : Yes
　　Remaining Time Alarm : No
　　Remaining Capacity Alarm: No
　　Discharge Terminated : No
　　Over Temperature : No
　　Charging Terminated : No
　　Over Charged : No
　　Relative State of Charge: 99 %
　　Charger System State: 49168
　　Charger System Ctrl: 0
　　Charging current: 0 mA
　　Absolute state of charge: 21 %
　　Max Error: 2 %
　　Exit Code: 0x00
　　批量检测BBU 信息:
　　dcli -g ~/cell_group -l root -t '{
　　uname -srm ; head -1 /etc/*release ; uptime | cut -d, -f1 ; imagehistory ;
　　ipmitool sunoem cli "show /SP system_description system_identifier" | grep = ;
　　ipmitool sunoem cli "show /SP/policy FLASH_ACCELERATOR_CARD_INSTALLED
　　/opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -GetBbuStatus -a0 | egrep -i
　　'BBU|Battery|Charge:|Fully|Low|Learn' ;
　　}' | tee /tmp/ExaInfo.log
1.11 Exadata 停机
　　1. 确认无业务访问，以root 用户登录第1 个数据库服务器节点
　　2. 停止数据库（详见RAC/ASM 维护之RAC 启停章节）
　　3. 停止Cluster
　　# GRID_HOME/grid/bin/crsctl stop cluster -all
　　4. 停除本机以外的数据库节点
　　# dcli -l root -c dm01db02,dm01db03,dm01db04 shutdown -h -y now
　　5. 停存储服务器
　　cell_group 可自编辑，执行时并可由root 用户读取该文件(askmaclean.com)
　　另需参考Storage Cell 存储维护Storage Cell 停机章节信息后方可执行下述命令
　　# dcli -l root -g cell_group shutdown -h -y now
　　6. 停本机
　　# shutdown -h -y now
　　7. 此时可通过ILOM 远程关机
　　8. 整机下电(关PDU)
1.12 Exadata 启动
　　1. 为机柜加电（SWITCH 自然加电）
　　打开PDU开关进行加电，服务器指示灯都变绿，慢闪
　　若需手工开机数据库服务器、存储服务器需要按住其开关5秒。
　　也可在ILOM中点击Cell的Poweron开关进行开机，服务器指示灯为绿色长亮，再点击DB Server
　　的Poweron开关进行开机，服务器指示灯为绿色长亮。
　　2. 检查是否有黄灯报警。
　　3. 启动数据库、应用等。
2. Infiniband篇
2.1 启停IBSwitch
　　1. InfiniBand Switch电源的开启或关闭
　　InfiniBand Switch提供冗余电源,分别插在Exadata的2个冗余PDU电源上,并随PDU机柜电源
　　开启或关闭,若关闭InfiniBand Switch需断掉InfiniBand Switch的的冗余电源。 2. 查看OEM等是否有相关报警。
　　ILOM无法报警
　　从cell1的cellcli中查看list alerthistory可以看到
　　3. 从db01查看网络拓扑状态
　　# cd /opt/oracle.SupportTools/ibdiagtools
　　# ./verify-topology -t halfrack
　　4. 插入InfiniBand电源线,查看InfiniBand Switch正常启动
2.2 检查IB链路状态
　　# /opt/oracle.SupportTools/ibdiagtools/infinicheck -z
　　# /opt/oracle.SupportTools/ibdiagtools/infinicheck
2.3 检查IB网络拓扑状态
　　登陆任意Database Server，采用Exadata工具命令：
　　# cd /opt/oracle.SupportTools/ibdiagtools
　　# ./verify-topology -t halfrack
2.4 诊断IB链路
　　# ibdiagnet -c 1000 –r
2.5 查看IB网络连线
　　以root用户登陆InfiniBand Switch ILOM，采用listlinkup命令显示：
　　# listlinkup
　　Connector 0A Present <-> I4 Port 31 is ip
2.6 查看IB健康状态
　　# showunhealthy
　　OK – No unhealthy sensors.
2.7 IB健康检查
　　env_test
2.8 IB故障处理
　　1. 确认已经备份IB SWITCH
　　2. 确认所有的cable已经label，之后从IB switch上拔下cable
　　3. 拔下两根电源线poweroff
　　4. 取出IB switch
　　5. 安装新IB switch
　　6. 恢复IB switch设置
　　7. Disable the Subnet Manager
　　Disablesm
　　8. 连接cable
　　9. 确认cable连接的正确性
　　/opt/oracle.SupportTools/ibdiagtools/verify-topology
　　10. 从任何主机上运行如下命令确认任何link没有错误
　　ibdiagnet -c 1000 –r
　　11. Enable the Subnet Manager using
　　Enablesm
2.9 IB硬件监控
　　showunhealthy & checkpower
　　Switch端口错误
　　ibqueryerrors.pl -s RcvSwRelayErrors,RcvRemotePhysE rrors,XmtDiscards,XmtConstraint Errors,RcvConstraintErrors,ExcB ufOverrunErrors,VL15Dropped
2.10 Link状态
　　/usr/sbin/iblinkinfo.pl -Rl
2.11 Subnet manager
　　/usr/sbin/sminfo
3. CISCO交换机
3.1 例行维护
　　采用Cisco IOS系统命令行方式,启动终端登陆管理网口IP:telnet xxx.xxx.xxx.xxx
　　输入用户名(root)/口令(welcome1),进入enable模式:
　　查看交换机的配置通过show命令查看:
　　dm01sw-ip#show running-config Building configuration…
　　显示信息包括交换机主机名称、IP地址、网关地址、IOS系统版本、时区信息、DNS配置、 NTP配置、各网络端口配置、VLAN划分(全交换机一个VLAN)配置信息等。
3.2 运行监控
　　通过目前 Cisco 交换机监控的规范进行监控。
　　由于Cisco主要用于管理网使用,当完全不能访问时,只影响管理网的相关功能,不影响业务网的正常运行。
　　当出现故障后,可采用目前Cisco交换机故障处理流程进行处理,并注意交换机主机名称、IP 地址、网关地址、IOS系统版本、时区信息、DNS配置、NTP配置、各网络端口配置、VLAN 划分(全交换机一个VLAN)等信息是否正确配置。
3.3 KVM
　　可通过 OEM GC 插件进行监控。
3.4 PDU
　　故障处理
　　单路故障不影响Exadata的连续性运行,但需要即时报修更换(包括管理IP等),以避免另外
　　备份PDU也出现故障,导致Exadata非正常停机。

页: [1]

运维网's Archiver

Exadata使用技巧 (二)