设为首页 收藏本站
查看: 2071|回复: 1

[经验分享] heartbeat + pacemaker实现pg流复制自动切换(二)

[复制链接]

尚未签到

发表于 2019-1-7 09:38:45 | 显示全部楼层 |阅读模式
五、测试
5.1 备节点失效
  在node2上杀死postgres数据库进程,模拟备节点上数据库崩溃:
[root@node2 ~]# killall -9 postgres  查看此时集群状态:
[root@node1 ~]# crm_mon -Afr1  
============
  
Last updated: Mon Jan 27 08:36:49 2014
  
Stack: Heartbeat
  
Current DC: node1 (30b7dc95-25c5-40d7-b1e4-7eaf2d5cdf07) - partition with quorum
  
Version: 1.0.12-unknown
  
2 Nodes configured, unknown expected votes
  
4 Resources configured.
  
============
  

  
Online: [ node1 node2 ]
  

  
Full list of resources:
  

  
vip-slave (ocf::heartbeat:IPaddr2): Started node1
  
Resource Group: master-group
  
     vip-master (ocf::heartbeat:IPaddr2): Started node1
  
     vip-rep (ocf::heartbeat:IPaddr2): Started node1
  
Master/Slave Set: msPostgresql
  
     Masters: [ node1 ]
  
     Stopped: [ pgsql:1 ]
  
Clone Set: clnPingCheck
  
     Started: [ node1 node2 ]
  

  
Node Attributes:
  
* Node node1:
  
    + default_ping_set                 : 100
  
    + master-pgsql:0                   : 1000
  
    + pgsql-data-status                : LATEST
  
    + pgsql-master-baseline            : 0000000010000000
  
    + pgsql-status                     : PRI
  
* Node node2:
  
    + default_ping_set                 : 100
  
    + master-pgsql:1                   : -INFINITY
  
    + pgsql-data-status                : DISCONNECT
  
    + pgsql-status                     : STOP
  

  
Migration summary:
  
* Node node1:
  
* Node node2:
  
   pgsql:1: migration-threshold=1 fail-count=1
  

  
Failed actions:
  
    pgsql:1_monitor_7000 (node=node2, call=11, rc=7, status=complete): not running
  {vip-slave资源已成功切换到了node1上}
  重启node2上的heartbeat,数据库将重新伴随启动:
[root@node2 ~]# service heartbeat restart  过段时间后查看状态:
[root@node1 ~]# crm_mon -Afr1  
============
  
Last updated: Mon Jan 27 08:39:16 2014
  
Stack: Heartbeat
  
Current DC: node1 (30b7dc95-25c5-40d7-b1e4-7eaf2d5cdf07) - partition with quorum
  
Version: 1.0.12-unknown
  
2 Nodes configured, unknown expected votes
  
4 Resources configured.
  
============
  

  
Online: [ node1 node2 ]
  

  
Full list of resources:
  

  
vip-slave (ocf::heartbeat:IPaddr2): Started node2
  
Resource Group: master-group
  
     vip-master (ocf::heartbeat:IPaddr2): Started node1
  
     vip-rep (ocf::heartbeat:IPaddr2): Started node1
  
Master/Slave Set: msPostgresql
  
     Masters: [ node1 ]
  
     Slaves: [ node2 ]
  
Clone Set: clnPingCheck
  
     Started: [ node1 node2 ]
  

  
Node Attributes:
  
* Node node1:
  
    + default_ping_set                 : 100
  
    + master-pgsql:0                   : 1000
  
    + pgsql-data-status                : LATEST
  
    + pgsql-master-baseline            : 0000000010000000
  
    + pgsql-status                     : PRI
  
* Node node2:
  
    + default_ping_set                 : 100
  
    + master-pgsql:1                   : 100
  
    + pgsql-data-status                : STREAMING|SYNC
  
    + pgsql-status                     : HS:sync
  

  
Migration summary:
  
* Node node1:
  
* Node node2:
  {vip-slave又重新回到了nod2上,且流复制重新建立}
5.2 主节点失效切换
  在node1上杀死postgres数据库进程,模拟备节点上数据库崩溃:
[root@node1 ~]# killall -9 postgres  等会查看集群状态:
[root@node2 ~]# crm_mon -Afr -1  
============
  
Last updated: Mon Jan 27 08:43:03 2014
  
Stack: Heartbeat
  
Current DC: node1 (30b7dc95-25c5-40d7-b1e4-7eaf2d5cdf07) - partition with quorum
  
Version: 1.0.12-unknown
  
2 Nodes configured, unknown expected votes
  
4 Resources configured.
  
============
  

  
Online: [ node1 node2 ]
  

  
Full list of resources:
  

  
vip-slave (ocf::heartbeat:IPaddr2): Started node2
  
Resource Group: master-group
  
     vip-master (ocf::heartbeat:IPaddr2): Started node2
  
     vip-rep (ocf::heartbeat:IPaddr2): Started node2
  
Master/Slave Set: msPostgresql
  
     Masters: [ node2 ]
  
     Stopped: [ pgsql:0 ]
  
Clone Set: clnPingCheck
  
     Started: [ node1 node2 ]
  

  
Node Attributes:
  
* Node node1:
  
    + default_ping_set                 : 100
  
    + master-pgsql:0                   : -INFINITY
  
    + pgsql-data-status                : DISCONNECT
  
    + pgsql-status                     : STOP
  
* Node node2:
  
    + default_ping_set                 : 100
  
    + master-pgsql:1                   : 1000
  
    + pgsql-data-status                : LATEST
  
    + pgsql-master-baseline            : 00000000120000B0
  
    + pgsql-status                     : PRI
  

  
Migration summary:
  
* Node node1:
  
   pgsql:0: migration-threshold=1 fail-count=1
  
* Node node2:
  

  
Failed actions:
  
    pgsql:0_monitor_2000 (node=node1, call=25, rc=7, status=complete): not running
  {vip-master/vip-rep都已成功切换到node2上,且node2已变为master,node2上pg数据库状态已切换为PRI}
5.3 主节点恢复
  修复原主节点后将其恢复为当前备节点
  在node1上执行一次基础同步:
[postgres@node1 data]$ pwd  
/opt/pgsql/data
  
[postgres@node1 data]$ rm -rf *
  
[postgres@node1 data]$ pg_basebackup -h 192.168.2.3 -U postgres -D /opt/pgsql/data/ -P
  
19172/19172 kB (100%), 1/1 tablespace
  
NOTICE:  pg_stop_backup complete, all required WAL segments have been archived
  
[postgres@node1 data]$ ls
  
backup_label      base    pg_clog      pg_ident.conf  pg_notify  pg_stat_tmp  pg_tblspc    PG_VERSION  postgresql.conf
  
backup_label.old  global  pg_hba.conf  pg_multixact   pg_serial  pg_subtrans  pg_twophase  pg_xlog     recovery.done
  启动heartbeat之前必须删除资锁,不然资源将不会伴随heartbeat启动:
[root@node1 ~]# rm -rf /var/lib/pgsql/tmp/PGSQL.lock  {该锁文件在当节点为主节点时创建,但不会因为heartbeat的异常停止或数据库/系统的异常终止而自动删除,所以在恢复一个节点的时候只要该节点充当过主节点就需要手动清理该锁文件}
  重启node1上的heartbeat:
[root@node1 ~]# service heartbeat restart  过段时间后查看集群状态:
[root@node2 ~]# crm_mon -Afr1  
============
  
Last updated: Mon Jan 27 08:50:43 2014
  
Stack: Heartbeat
  
Current DC: node2 (f2dcd1df-7429-42f5-82e9-b73921f97cab) - partition with quorum
  
Version: 1.0.12-unknown
  
2 Nodes configured, unknown expected votes
  
4 Resources configured.
  
============
  

  
Online: [ node1 node2 ]
  

  
Full list of resources:
  

  
vip-slave (ocf::heartbeat:IPaddr2): Started node1
  
Resource Group: master-group
  
     vip-master (ocf::heartbeat:IPaddr2): Started node2
  
     vip-rep (ocf::heartbeat:IPaddr2): Started node2
  
Master/Slave Set: msPostgresql
  
     Masters: [ node2 ]
  
     Slaves: [ node1 ]
  
Clone Set: clnPingCheck
  
     Started: [ node1 node2 ]
  

  
Node Attributes:
  
* Node node1:
  
    + default_ping_set                 : 100
  
    + master-pgsql:0                   : 100
  
    + pgsql-data-status                : STREAMING|SYNC
  
    + pgsql-status                     : HS:sync
  
* Node node2:
  
    + default_ping_set                 : 100
  
    + master-pgsql:1                   : 1000
  
    + pgsql-data-status                : LATEST
  
    + pgsql-master-baseline            : 00000000120000B0
  
    + pgsql-status                     : PRI
  

  
Migration summary:
  
* Node node1:
  
* Node node2:
  {vip-slave已成功切到node1上,node1成功成为流复制备节点}
六、管理
6.1 启动关闭heartbeat
[root@node1 ~]# service heartbeat start  
[root@node1 ~]# service heartbeat stop
6.2 查看HA状态
[root@node1 ~]# crm status6.3 查看资源状态及节点属性
[root@node1 ~]# crm_mon -Afr -16.4 查看配置
[root@node1 ~]# crm configure show6.5 实时监控HA
[root@node1 ~]# crm_mon -Afr6.6 crm_resource命令
资源启动/关闭:
[root@node1 ~]# crm_resource -r vip-master -v started  
[root@node1 ~]# crm_resource -r vip-master -v stoped
列举资源:
[root@node1 ~]# crm_resource -L  
vip-slave (ocf::heartbeat:IPaddr2): Started
  
Resource Group: master-group
  
     vip-master (ocf::heartbeat:IPaddr2): Started
  
     vip-rep (ocf::heartbeat:IPaddr2): Started
  
Master/Slave Set: msPostgresql [pgsql]
  
     Masters: [ node1 ]
  
     Slaves: [ node2 ]
  
Clone Set: clnPingCheck [pingCheck]
  
     Started: [ node1 node2 ]
查看资源位置:
[root@node1 ~]# crm_resource -W -r pgsql  
resource pgsql is running on: node2
迁移资源:
[root@node1 ~]# crm_resource -M -r vip-slave -N node2删除资源:
[root@node1 ~]# crm_resource -D -r vip-slave -t primitive6.7 crm命令
列举指定的RA:
[root@node1 ~]# crm ra list ocf pacemaker  
ClusterMon     Dummy          HealthCPU      HealthSMART    Stateful       SysInfo        SystemHealth   controld       ping           pingd
  
remote
删除节点:
[root@node1 ~]# crm node delete node2停用节点:
[root@node1 ~]# crm node standby node2启用节点:
[root@node1 ~]# crm node online node2配置pacemaker:
[root@node1 ~]# crm configure  
crm(live)configure#
  
……
  
……
  
crm(live)configure# commit
  
crm(live)configure# quit
6.8 重置failcount
[root@node1 ~]# crm resource  
crm(live)resource# failcount pgsql set node1 0
  
crm(live)resource# failcount pgsql show node1
  
scope=status  name=fail-count-pgsql value=0
[root@node1 ~]# crm resource cleanup pgsql  
Cleaning up pgsql:0 on node1
  
Waiting for 1 replies from the CRMd. OK
[root@node1 ~]# crm_failcount -G -U node1 -r pgsql  
scope=status  name=fail-count-pgsql value=INFINITY
  
[root@node1 ~]# crm_failcount -D -U node1 -r pgsql
七、问题记录
7.1 Q1
  问题现象:
  heartbeat日志中报如下错误:
  Jan 24 07:47:36 node1 heartbeat: [2515]: WARN: nodename node1 uuid changed to node2
  Jan 24 07:47:38 node1 heartbeat: [2515]: WARN: nodename node2 uuid changed to node1
  Jan 24 07:47:38 node1 heartbeat: [2515]: WARN: nodename node1 uuid changed to node2
  Jan 24 07:47:40 node1 heartbeat: [2515]: WARN: nodename node2 uuid changed to node1
  Jan 24 07:47:40 node1 heartbeat: [2515]: WARN: nodename node1 uuid changed to node2
  Jan 24 07:47:42 node1 heartbeat: [2515]: WARN: nodename node2 uuid changed to node1
  Jan 24 07:47:42 node1 heartbeat: [2515]: WARN: nodename node1 uuid changed to node2
  解决方式:
  因为是通过虚拟机克隆生成的node2,所以hb_uuid相同,需要删除后重新生成,如下:
[root@node2 ~]# rm -rf /var/lib/heartbeat/hb_uuid  
[root@node2 ~]# service heartbeat restart
  重启之后将会生成新的hb_uuid
7.2 Q2
  问题现象:
  加载配置时报错:
[root@node1 ~]# crm configure load update pgsql.crm  
ERROR: pgsql: parameter rep_mode does not exist
  
ERROR: pgsql: parameter node_list does not exist
  
ERROR: pgsql: parameter master_ip does not exist
  
ERROR: pgsql: parameter restore_command does not exist
  
ERROR: pgsql: parameter primary_conninfo_opt does not exist
  
WARNING: pgsql: specified timeout 60s for stop is smaller than the advised 120
  
WARNING: pgsql: action monitor_Master not advertised in meta-data, it may not be supported by the RA
  
WARNING: pgsql: specified timeout 60s for start is smaller than the advised 120
  
WARNING: pgsql: action notify not advertised in meta-data, it may not be supported by the RA
  
WARNING: pgsql: action demote not advertised in meta-data, it may not be supported by the RA
  
WARNING: pgsql: action promote not advertised in meta-data, it may not be supported by the RA
  
WARNING: pingCheck: specified timeout 60s for start is smaller than the advised 90
  
WARNING: pingCheck: specified timeout 60s for stop is smaller than the advised 100
  
Do you still want to commit?
  解决方式:
  原因是pgsql脚本过旧,不支持配置pgsql.crm中设置的一些参数,需要从网上下载并替换pgsql
  https://raw.github.com/ClusterLabs/resource-agents
7.3 Q3
  问题现象:
  加载配置时报错:
[root@node1 ~]# crm configure load update pgsql.crm  
lrmadmin[15368]: 2014/01/24_09:18:44 ERROR: lrm_get_rsc_type_metadata(578): got a return code HA_FAIL from a reply message of rmetadata with function get_ret_from_msg.
  
ERROR: ocf:heartbeat:pgsql: could not parse meta-data:
  
ERROR: ocf:heartbeat:pgsql: could not parse meta-data:
  
ERROR: ocf:heartbeat:pgsql: no such resource agent
  
WARNING: pingCheck: specified timeout 60s for start is smaller than the advised 90
  
WARNING: pingCheck: specified timeout 60s for stop is smaller than the advised 100
  
Do you still want to commit?
  解决方式:
  原因是pgsql脚本权限不正确,使用下面命令修改即可:
  # chmod 755 /usr/lib/ocf/resource.d/heartbeat/pgsql
7.4 Q4
  问题现象:
  启动heartbeat时报错:
[root@node1 ~]# service heartbeat start  
/usr/lib/ocf/lib//heartbeat/ocf-shellfuncs: line 56: @OCF_ROOT_DIR@/lib/heartbeat/ocf-binaries: No such file or directory
  解决方式:
  因为在CentOS5.5中@OCF_ROOT_DIR@变量无法替换为正确的路径导致,可通过修改脚本实现,如下:
  编辑ocf-shellfuncs修改如下内容:
  if [ -z "$OCF_ROOT" ]; then
  #    : ${OCF_ROOT=@OCF_ROOT_DIR@}
  : ${OCF_ROOT=/usr/lib/ocf}
  fi
7.5 Q5
  问题现象:
  启动heartbeat时报错:
# service heartbeat start  
/usr/lib/ocf/lib//heartbeat/ocf-shellfuncs: line 60: /usr/lib/ocf/lib/heartbeat/ocf-rarun: No such file or directory
  解决方式:
  因为缺少ocf-rarun脚本导致
  下载放入相应路径即可:
  下载地址https://raw.github.com/ClusterLabs/resource-agents
7.6 Q6
  问题现象:
  启动heartbeat时因找不到启动脚本而报错:
[root@db1 ~]# service heartbeat start  
Starting High-Availability services:  Heartbeat failure [rc=6]. Failed.
  

  
heartbeat[2074]: 2014/01/23_09:06:59 info: Pacemaker support: yes
  
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Client child command [/usr/lib64/heartbeat/cib] is not executable
  
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Directive failfast  hacluster /usr/lib64/heartbeat/cib failed
  
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Client child command [/usr/lib64/heartbeat/stonithd] is not executable
  
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Directive respawn root /usr/lib64/heartbeat/stonithd failed
  
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Client child command [/usr/lib64/heartbeat/attrd] is not executable
  
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Directive respawn  hacluster /usr/lib64/heartbeat/attrd failed
  
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Client child command [/usr/lib64/heartbeat/crmd] is not executable
  
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Directive failfast  hacluster /usr/lib64/heartbeat/crmd failed
  
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Heartbeat not started: configuration error.
  
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Configuration error, heartbeat not started.
  解决方式:
ln -s /usr/libexec/pacemaker/cib /usr/lib64/heartbeat/cib  
ln -s /usr/libexec/pacemaker/stonithd /usr/lib64/heartbeat/stonithd
  
ln -s /usr/libexec/pacemaker/attrd /usr/lib64/heartbeat/attrd
  
ln -s /usr/libexec/pacemaker/crmd /usr/lib64/heartbeat/crmd
  7.7 Q7
  问题现象:
  启动heartbeat时报错:
  Jan 23 09:10:15 db1 heartbeat: [2129]: info: Heartbeat generation: 1390439416
  Jan 23 09:10:15 db1 heartbeat: [2129]: info: No uuid found for current node - generating a new uuid.
  Jan 23 09:10:15 db1 heartbeat: [2129]: info: Creating FIFO /var/lib/heartbeat/fifo.
  Jan 23 09:10:15 db1 heartbeat: [2129]: info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth1
  Jan 23 09:10:15 db1 heartbeat: [2129]: info: glib: ucast: bound send socket to device: eth1
  Jan 23 09:10:15 db1 heartbeat: [2129]: ERROR: glib: ucast: error setting option SO_REUSEPORT(w): Protocol not available
  Jan 23 09:10:15 db1 heartbeat: [2129]: ERROR: make_io_childpair: cannot open ucast eth1
  Jan 23 09:10:16 db1 heartbeat: [2132]: CRIT: Emergency Shutdown: Master Control process died.
  Jan 23 09:10:16 db1 heartbeat: [2132]: CRIT: Killing pid 2129 with SIGTERM
  Jan 23 09:10:16 db1 heartbeat: [2132]: CRIT: Emergency Shutdown(MCP dead): Killing ourselves.
  解决方式:
  1.升级内核版本,当前内核版本不支持ucast;
  2.换用其它的检测方式,如mcast/bcast。
7.8 Q8
  问题现象:
  使用bcast心跳检测方式时报错:
  Jan 24 01:30:20 db2 heartbeat: [29856]: ERROR: glib: Error binding socket (Address already in use). Retrying.
  Jan 24 01:30:21 db2 heartbeat: [29856]: ERROR: glib: Error binding socket (Address already in use). Retrying.
  Jan 24 01:30:22 db2 heartbeat: [29856]: ERROR: glib: Error binding socket (Address already in use). Retrying.
  Jan 24 01:30:23 db2 heartbeat: [29856]: ERROR: glib: Error binding socket (Address already in use). Retrying.
  Jan 24 01:30:24 db2 heartbeat: [29856]: ERROR: glib: Unable to bind socket (Address already in use). Giving up.
  Jan 24 01:30:24 db2 heartbeat: [29856]: info: glib: UDP Broadcast heartbeat closed on port 694 interface eth1 - Status: 1
  Jan 24 01:30:24 db2 heartbeat: [29856]: ERROR: make_io_childpair: cannot open bcast eth1
  Jan 24 01:30:25 db2 heartbeat: [29859]: CRIT: Emergency Shutdown: Master Control process died.
  Jan 24 01:30:25 db2 heartbeat: [29859]: CRIT: Killing pid 29856 with SIGTERM
  Jan 24 01:30:25 db2 heartbeat: [29859]: CRIT: Emergency Shutdown(MCP dead): Killing ourselves.
  解决方式:
  说明694端口已经被占用,查看
[root@db1 ~]# netstat -nlp | grep 694  
udp        0      0 0.0.0.0:694                 0.0.0.0:*                               1367/rpcbind
  
udp        0      0 :::694                      :::*                                    1367/rpcbind
  换个UDP端口,如在ha.cf中指定udpport 692
八、参考资源
  脚本:
  https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/pgsql
  脚本使用说明:
  https://github.com/t-matsuo/resource-agents/wiki/Resource-Agent-for-PostgreSQL-9.1-streaming-replication
  crm_resouce命令:
  http://www.novell.com/zh-cn/documentation/sle_ha/book_sleha/data/man_crmresource.html
  crm_failcount命令:
  http://www.novell.com/zh-cn/documentation/sle_ha/book_sleha/data/man_crmfailcount.html



运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-660212-1-1.html 上篇帖子: 6.Heartbeat和DRBD 高可用 下篇帖子: debian 安装配置heartbeat
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表