xiaoxue85 发表于 2018-9-26 11:17:38

Oracle技术之11.2 RAC时间同步异常(二)

  一个客户的11.2的RAC环境,出现了一个节点DOWN掉的问题。
  这篇继续描述定位问题、解决问题的过程。
  在检查了ASM和数据库的告警日志后,发现错误是由于CLUSTER问题导致,而分析了CLUSTER上的告警信息,进一步发现问题似乎和时间同步服务有关。
  检查otcssd日志信息发现:
  2010-05-10 15:21:25.785: [ CTSS]ctssslave_swm: The magnitude of the offset [-31532764620334 usec] is larger than sec which is the CTSS limit.
  2010-05-10 15:21:25.785: [ CTSS]ctssslave_swm: The magnitude of the systime diff is larger than max adjtime limit. Offset [-31532764620334] usec will be changed to max adjtime limit [+/- 131071].
  2010-05-10 15:21:25.785: [ CTSS]ctssslave_swm15: The CTSS master is ahead this node. The local time offset is being adjusted. Sync method
  2010-05-10 15:21:25.786: [ CTSS]ctssslave_swm17: LT , MT , Delta
  2010-05-10 15:21:25.786: [ CTSS]ctssslave_swm19: The offset is [-131071 usec] and sync interval set to
  2010-05-10 15:21:25.786: [ CTSS]ctssslave_swm: Received from master (mode nodenum hostname )
  2010-05-10 15:21:25.786: [ CTSS]ctsselect_msm: Sync interval returned in
  2010-05-10 15:21:25.786: [ CTSS]ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler
  2010-05-10 15:21:26.886: [ CTSS]ctsselect_msm: CTSS mode is
  2010-05-10 15:21:26.886: [ CTSS]ctssslave_swm1_2: Ready to initiate new time sync process.
  2010-05-10 15:21:26.886: [ CTSS]ctssslave_swm2_1: Waiting for time sync message from master. sync_state.
  2010-05-10 15:21:26.887: [ CTSS]ctsscomm_recv_cb2: Receive incoming message event. Msgtype .
  2010-05-10 15:21:26.887: [ CTSS]ctssslave_msg_handler4_1: Waiting for slave_sync_with_master to finish sync process. sync_state.
  2010-05-10 15:21:26.887: [ CTSS]ctssslave_swm2_3: Received time sync message from master.
  2010-05-10 15:21:26.887: [ CTSS]ctssslave_swm: The magnitude of the offset [-31532764520368 usec] is larger than sec which is the CTSS limit.
  2010-05-10 15:21:26.887: [ CTSS]ctssslave_swm: The magnitude of the systime diff is larger than max adjtime limit. Offset [-31532764520368] usec will be changed to max adjtime limit [+/- 131071].
  2010-05-10 15:21:26.887: [ CTSS]ctssslave_swm15: The CTSS master is ahead this node. The local time offset is being adjusted. Sync method
  2010-05-10 15:21:26.888: [ CTSS]ctssslave_swm17: LT , MT , Delta
  2010-05-10 15:21:26.888: [ CTSS]ctssslave_swm19: The offset is [-131071 usec] and sync interval set to
  2010-05-10 15:21:26.889: [ CTSS]ctssslave_swm: Received from master (mode nodenum hostname )
  2010-05-10 15:21:26.889: [ CTSS]ctsselect_msm: Sync interval returned in
  2010-05-10 15:21:26.889: [ CTSS]ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler
  2010-05-10 15:21:27.989: [ CTSS]ctsselect_msm: CTSS mode is
  2010-05-10 15:21:27.989: [ CTSS]ctssslave_swm1_2: Ready to initiate new time sync process.
  2010-05-10 15:21:27.989: [ CTSS]ctssslave_swm2_1: Waiting for time sync message from master. sync_state.
  2010-05-10 15:21:27.990: [ CTSS]ctsscomm_recv_cb2: Receive incoming message event. Msgtype .
  2010-05-10 15:21:27.990: [ CTSS]ctssslave_msg_handler4_1: Waiting for slave_sync_with_master to finish sync process. sync_state.
  2010-05-10 15:21:27.990: [ CTSS]ctssslave_swm2_3: Received time sync message from master.
  2010-05-10 15:21:27.990: [ CTSS]ctssslave_swm: The magnitude of the offset [-31532764419342 usec] is larger than sec which is the CTSS limit.
  2010-05-10 15:21:27.990: [ CTSS]ctssslave_swm: The magnitude of the systime diff is larger than max adjtime limit. Offset [-31532764419342] usec will be changed to max adjtime limit [+/- 131071].
  2010-05-10 15:21:27.990: [ CTSS]ctssslave_swm15: The CTSS master is ahead this node. The local time offset is being adjusted. Sync method
  2010-05-10 15:21:27.990: [ CTSS]ctssslave_swm17: LT , MT , Delta
  2010-05-10 15:21:27.991: [ CTSS]ctssslave_swm19: The offset is [-131071 usec] and sync interval set to
  2010-05-10 15:21:27.991: [ CTSS]ctssslave_swm: Received from master (mode nodenum hostname )
  2010-05-10 15:21:27.991: [ CTSS]ctsselect_msm: Sync interval returned in
  2010-05-10 15:21:27.991: [ CTSS]ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler
  2010-05-10 15:21:29.062: [ CTSS]ctss_checkcb: clsdm requested check alive. checkcb_data{mode, offset[-1467993347 ms]}, length=.
  2010-05-10 15:21:29.091: [ CTSS]ctsselect_msm: CTSS mode is
  2010-05-10 15:21:29.091: [ CTSS]ctssslave_swm1_2: Ready to initiate new time sync process.
  2010-05-10 15:21:29.091: [ CTSS]ctssslave_swm2_1: Waiting for time sync message from master. sync_state.
  2010-05-10 15:21:29.092: [ CTSS]ctsscomm_recv_cb2: Receive incoming message event. Msgtype .
  2010-05-10 15:21:29.092: [ CTSS]ctssslave_msg_handler4_1: Waiting for slave_sync_with_master to finish sync process. sync_state.
  2010-05-10 15:21:29.092: [ CTSS]ctssslave_swm2_3: Received time sync message from master.
  2010-05-10 15:21:29.092: [ CTSS]ctssslave_swm: The magnitude of the offset [-31532764319434 usec] is larger than sec which is the CTSS limit.
  2010-05-10 15:21:29.092: [ CTSS]ctssslave_swm: The magnitude of the systime diff is larger than max adjtime limit. Offset [-31532764319434] usec will be changed to max adjtime limit [+/- 131071].
  2010-05-10 15:21:29.092: [ CTSS]ctssslave_swm15: The CTSS master is ahead this node. The local time offset is being adjusted. Sync method
  2010-05-10 15:21:29.093: [ CTSS]ctssslave_swm17: LT , MT , Delta
  2010-05-10 15:21:29.093: [ CTSS]ctssslave_swm19: The offset is [-131071 usec] and sync interval set to
  2010-05-10 15:21:29.093: [ CTSS]ctssslave_swm: Received from master (mode nodenum hostname )
  2010-05-10 15:21:29.093: [ CTSS]ctsselect_msm: Sync interval returned in
  2010-05-10 15:21:29.093: [ CTSS]ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler
  2010-05-10 15:21:30.193: [ CTSS]ctsselect_msm: CTSS mode is
  2010-05-10 15:21:30.193: [ CTSS]ctssslave_swm1_2: Ready to initiate new time sync process.
  2010-05-10 15:21:30.193: [ CTSS]ctssslave_swm2_1: Waiting for time sync message from master. sync_state.
  2010-05-10 15:21:30.194: [ CTSS]ctsscomm_recv_cb2: Receive incoming message event. Msgtype .
  2010-05-10 15:21:30.194: [ CTSS]ctssslave_msg_handler4_1: Waiting for slave_sync_with_master to finish sync process. sync_state.
  2010-05-10 15:21:30.194: [ CTSS]ctssslave_swm2_3: Received time sync message from master.
  2010-05-10 15:21:30.194: [ CTSS]ctssslave_swm: The magnitude of the offset [-31532764219409 usec] is larger than sec which is the CTSS limit.
  2010-05-10 15:21:30.194: [ CTSS]ctssslave_swm: The magnitude of the systime diff is larger than max adjtime limit. Offset [-31532764219409] usec will be changed to max adjtime limit [+/- 131071].
  2010-05-10 15:21:30.194: [ CTSS]ctssslave_swm15: The CTSS master is ahead this node. The local time offset is being adjusted. Sync method
  2010-05-10 15:21:30.195: [ CTSS]ctssslave_swm17: LT , MT , Delta
  2010-05-10 15:21:30.195: [ CTSS]ctssslave_swm19: The offset is [-131071 usec] and sync interval set to
  2010-05-10 15:21:30.195: [ CTSS]ctssslave_swm: Received from master (mode nodenum hostname )
  2010-05-10 15:21:30.195: [ CTSS]ctsselect_msm: Sync interval returned in
  2010-05-10 15:21:30.195: [ CTSS]ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler
  2010-05-10 15:21:31.295: [ CTSS]ctsselect_msm: CTSS mode is
  2010-05-10 15:21:31.295: [ CTSS]ctssslave_swm1_2: Ready to initiate new time sync process.
  2010-05-10 15:21:31.295: [ CTSS]ctssslave_swm2_1: Waiting for time sync message from master. sync_state.
  2010-05-10 15:21:31.296: [ CTSS]ctsscomm_recv_cb2: Receive incoming message event. Msgtype .
  2010-05-10 15:21:31.296: [ CTSS]ctssslave_msg_handler4_1: Waiting for slave_sync_with_master to finish sync process. sync_state.
  2010-05-10 15:21:31.296: [ CTSS]ctssslave_swm2_3: Received time sync message from master.
  2010-05-10 15:21:31.296: [ CTSS]ctssslave_swm: The magnitude of the offset [-31532764119367 usec] is larger than sec which is the CTSS limit.
  2010-05-10 15:21:31.296: [ CTSS]ctssslave_swm: The magnitude of the systime diff is larger than max adjtime limit. Offset [-31532764119367] usec will be changed to max adjtime limit [+/- 131071].
  2010-05-10 15:21:31.297: [ CTSS]ctssslave_swm15: The CTSS master is ahead this node. The local time offset is being adjusted. Sync method
  2010-05-10 15:21:31.297: [ CTSS]ctssslave_swm17: LT , MT , Delta
  2010-05-10 15:21:31.297: [ CTSS]ctssslave_swm19: The offset is [-131071 usec] and sync interval set to
  2010-05-10 15:21:31.297: [ CTSS]ctssslave_swm: Received from master (mode nodenum hostname )
  2010-05-10 15:21:31.297: [ CTSS]ctsselect_msm: Sync interval returned in
  2010-05-10 15:21:31.297: [ CTSS]ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler
  2010-05-10 15:21:32.397: [ CTSS]ctsselect_msm: CTSS mode is
  2010-05-10 15:21:32.397: [ CTSS]ctssslave_swm1_2: Ready to initiate new time sync process.
  2010-05-10 15:21:32.397: [ CTSS]ctssslave_swm2_1: Waiting for time sync message from master. sync_state.
  2010-05-10 15:21:59.396: [ CTSS]ctss_checkcb: clsdm requested check alive. checkcb_data{mode, offset[-1467993047 ms]}, length=.
  2010-05-10 15:22:02.684: [ CTSS]ctsscomm_recv_cb2: Receive incoming message event. Msgtype .
  2010-05-10 15:22:02.684: [ CTSS]ctssslave_msg_handler4_1: Waiting for slave_sync_with_master to finish sync process. sync_state.
  2010-05-10 15:22:02.684: [ CTSS]ctssslave_swm2_3: Received time sync message from master.
  2010-05-10 15:22:02.684: [ CTSS]ctssslave_swm: sendtime{sec, usec}, receivetime{sec, usec}.
  2010-05-10 15:22:02.684: [ CTSS]ctssslave_swm: The RTT of sync msg is too large for time sync to be accurate. Recommends retry. Returns .
  2010-05-10 15:22:02.684: [ CTSS]ctssslave_swm: Received from master (mode nodenum hostname )
  2010-05-10 15:22:02.684: [ CTSS]ctsselect_msm: Failed in sync_with_master
  2010-05-10 15:22:02.684: [ CTSS]ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler
  2010-05-10 15:22:02.685: [ CTSS]ctsselect_msm: Sync interval returned in
  2010-05-10 15:22:04.592: gipchaUpperProcessDisconnect: processing DISCONNECT for hendp 111411950 { gipchaEndpoint : port 'CTSSGROUP_2/811b-0c78-d761-0a0d', peer 'jzdbnc:9ead-4b66-b4c7-458c', srcCid 00000000-0000025a, dstCid 00000000-0000028a, numSend 0, maxSend 100, usrFlags 0x4000, flags 0x204 }
  2010-05-10 15:22:04.848: [ CSSCLNT]cls***ecvMsg: got a disconnect from the server while waiting for message type 1
  2010-05-10 15:22:04.848: [ CSSCLNT]clssgsGroupGetStatus: communications failed (0/3/-1)
  2010-05-10 15:22:04.848: [ CSSCLNT]clssgsGroupGetStatus: returning 8
  2010-05-10 15:22:04.848: [ CRSCCL]Daemon exiting due to error in clssgsgrpstat retval = 8

  2010-05-10 15:22:05.636: [ CTSS]Oracle Database CTSS>  2010-05-10 15:22:05.637: [ CTSS]ctss_scls_init: SCLs Context is 11098c370
  [ clsdmt]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=jzdbiufoDBG_CTSSD))
  2010-05-10 15:22:05.664: [ clsdmt]PID for the Process , connkey 11
  2010-05-10 15:22:05.664: [ clsdmt]Creating PID file for home /oracle/app/11.2.0/grid host jzdbiufo bin ctss to /oracle/app/11.2.0/grid/ctss/init/
  2010-05-10 15:22:05.664: [ clsdmt]Writing PID to the file
  2010-05-10 15:22:06.505: [ CTSS]ctss_checkcb: clsdm requested check alive. checkcb_data{mode, offset}, length=.
  2010-05-10 15:22:06.512: [ CTSS]ctss_css_init: clsssinit() failed with .
  2010-05-10 15:22:06.512: [ CTSS](:ctss_css_init1:): CTSS daemon exiting. CSS unresponsive
  [ CTSS]SLOS : SLOS: cat=1311719766, pn=skgxnqtsz, dep=0, loc=SKGXN not av
  2010-05-10 15:22:06.512: [ CTSS]ctss_main: CSS init failed
  2010-05-10 15:22:06.512: [ CTSS]ctss_main: CTSS init failed
  2010-05-10 15:22:06.512: [ CTSS]ctss_main: CTSS daemon aborting .
  2010-05-10 15:22:06.512: [ CTSS]CTSS daemon aborting
  2010-05-10 15:22:07.510: [ CTSS]ctss_checkcb: clsdm requested check alive. checkcb_data{mode, offset}, length=.
  2010-05-10 15:22:08.511: [ CTSS]ctss_checkcb: clsdm requested check alive. checkcb_data{mode, offset}, length=.
  2010-05-10 15:22:09.513: [ CTSS]ctss_checkcb: clsdm requested check alive. checkcb_data{mode, offset}, length=.
  在错误信息中,出现了多次告警信息:
  ctssslave_swm: The magnitude of the offset [-31532764119367 usec] is larger than sec which is the CTSS limit.
  显然Oracle的时间同步后台一直在调整两个服务器的时间,不过令人意外的是,本来只相差几秒的时间,为什么调整后变成了1个小时左右,而且这里的数值如此巨大。
  简单分析一下不难发现,Oracle的时间调整限制是1天:
  SQL> set numw 15
  SQL> select 24 * 60 * 60 * 1000000 from dual;
  24*60*60*1000000
  ----------------
  86400000000
  那么那个大的出奇的时间范围是什么呢:
  SQL> select 31532764119367 / 1000000 / 60 / 60 / 24 from dual;
  31532764119367/1000000/60/60/24
  -------------------------------
  364.96254767786
  居然是365天,也就是1年的时间,这是巧合吗,检查两个节点上的时间发现:
  $ hostname
  jzdbiufo
  $ date
  Mon May 10 16:31:33 GMT+08:00 2010
  另一个节点:
  # hostname
  jzdbnc
  # date
  Tue May 10 15:38:04 GMT+08:00 2011
  二者相差不是1个小时,而是差1个小时1年的时间差。
  显然是操作人员在手工调整时间的时候将一个节点上的日期错误的深入为2010年,导致两个节点的时间差达到1年之久,这么大的时间差,远远超过Oracle时间同步服务允许的最大限制,因此Oracle的时间同步只能在后台选择一个可接受的时间间隔,大概是0.13秒,来尝试不断的修改系统时间,以期达到不同两个节点时间的目的。
  在CLUSTER出于不断的同步时间的过程中,客户又在两个节点上进行了大数据量的导入工作,由此导致了短时间内两个节点间的通信出现了超时,导致一个节点被踢出了CLUSTER,而由于时间同步的问题,导致了节点重启后试图再次加入到CLUSTER中报错。
  找到问题,解决就很容易了,关闭CLUSTER后,将错误的节点上的时间调整到当前时间,再次启动CLUSTER和DB,问题解决。
  oracle视频教程请关注:http://u.youku.com/user_video/id_UMzAzMjkxMjE2.html

页: [1]
查看完整版本: Oracle技术之11.2 RAC时间同步异常(二)