远行的心 发表于 2018-9-25 06:54:16

环境:oracle 11.2.0.1 +aix6内存问题

  环境:oracle 11.2.0.1 +rac +AIX 6.1建立两套数据库
  1         问题描述
  2010年11月29日下午15点左右,p570a主机telnet不进去,应用新建连接不成功,严重影响到业务,16点赶到用户现场,进行应急处理。
  现把本次数据库应急故障处理、问题分析过程总结如下:
  2         应急处理
  通过hmc控制台,登录到p570a主机,输入任何命令都报内存不足,如下;
  root@p570a:/> errpt|more
  ksh: 0403-031 The fork function failed. There is not enough memory available.
  ksh: 0403-031 The fork function failed. There is not enough memory available.
  root@p570a:/> ps -ef | grep LOCAL=NO|wc -l
  ksh: 0403-031 The fork function failed. There is not enough memory available.
  root@p570a:/> ls
  ksh: 0403-031 The fork function failed. There is not enough memory available.
  
  征求用户意见同意后,通过hmc控制台,重启p570a主机。
  3         P570a故障分析
  3.1   操作系统Errpt
  p570a@root#errpt|more
  IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
  A6DF45AA1129164210 I O RMCdaemon   The daemon is started.
  EC0BCCD41129164110 T H ent1          ETHERNET DOWN
  67145A391129163910 U SSYSDUMP       SYSTEM DUMP
  F48137AC1129163810 U O minidump      COMPRESSED MINIMAL DUMP
  1104AA281129163810 T S SYSPROC       SYSTEM RESET INTERRUPT RECEIVED
  9DBCFDEE1129164110 T O errdemon      ERROR LOGGING TURNED ON
  B62673421126235510 P H hdisk3      DISK OPERATION ERROR
  B62673421125235510 P H hdisk3      DISK OPERATION ERROR
  C5C09FFA1125062110 P S SYSVMM      SOFTWARE PROGRAM ABNORMALLY TERMINATED
  C5C09FFA1125051010 P S SYSVMM      SOFTWARE PROGRAM ABNORMALLY TERMINATED
  C5C09FFA1124144010 P S SYSVMM      SOFTWARE PROGRAM ABNORMALLY TERMINATED
  C5C09FFA1124144010 P S SYSVMM      SOFTWARE PROGRAM ABNORMALLY TERMINATED
  C5C09FFA1124144010 P S SYSVMM      SOFTWARE PROGRAM ABNORMALLY TERMINATED
  C5C09FFA1124144010 P S SYSVMM      SOFTWARE PROGRAM ABNORMALLY TERMINATED
  C5C09FFA1124144010 P S SYSVMM      SOFTWARE PROGRAM ABNORMALLY TERMINATED
  C5C09FFA1124144010 P S SYSVMM      SOFTWARE PROGRAM ABNORMALLY TERMINATED
  C5C09FFA1124144010 P S SYSVMM      SOFTWARE PROGRAM ABNORMALLY TERMINATED
  C5C09FFA1124144010 P S SYSVMM      SOFTWARE PROGRAM ABNORMALLY TERMINATED
  C5C09FFA1124144010 P S SYSVMM      SOFTWARE PROGRAM ABNORMALLY TERMINATED
  C5C09FFA1124144010 P S SYSVMM      SOFTWARE PROGRAM ABNORMALLY TERMINATED
  C5C09FFA1124144010 P S SYSVMM      SOFTWARE PROGRAM ABNORMALLY TERMINATED
  C5C09FFA1124144010 P S SYSVMM      SOFTWARE PROGRAM ABNORMALLY TERMINATED
  C5C09FFA1124144010 P S SYSVMM      SOFTWARE PROGRAM ABNORMALLY TERMINATED
  C5C09FFA1124144010 P S SYSVMM      SOFTWARE PROGRAM ABNORMALLY TERMINATED
  C5C09FFA1124144010 P S SYSVMM      SOFTWARE PROGRAM ABNORMALLY TERMINATED
  C5C09FFA1124144010 P S SYSVMM      SOFTWARE PROGRAM ABNORMALLY TERMINATED
  C5C09FFA1124144010 P S SYSVMM      SOFTWARE PROGRAM ABNORMALLY TERMINATED
  
  p570a@root#errpt -aj C5C09FFA |more
  ---------------------------------------------------------------------------
  LABEL:         PGSP_KILL
  IDENTIFIER:    C5C09FFA
  
  Date/Time:      Thu Nov 25 06:21:13 BEIST 2010
  Sequence Number: 99122
  Machine>00C6E9C54C00
  Node>570a
  Class:          S
  Type:         PERM
  WPAR:         Global
  Resource Name:SYSVMM
  Description
  SOFTWARE PROGRAM ABNORMALLY TERMINATED
  Probable Causes
  SYSTEM RUNNING OUT OF PAGING SPACE
  Failure Causes
  INSUFFICIENT PAGING SPACE DEFINED FOR THE SYSTEM
  PROGRAM USING EXCESSIVE AMOUNT OF PAGING SPACE
  
  11月24号开始已经报没有足够的页面交换空间可以使用,可见物理内存早就用完。
  3.2   数据库警告文件
  alert_gzjb1.log从11月24号开始就有大量如下报错:
  Wed Nov 24 22:36:15 2010
  ORA-27302: failure occurred at: skgpspawn3
  ORA-27301: OS failure message: Not enough space
  ORA-27300: OS system dependent operation:fork failed with status: 12
  Errors in file /oracle/app/oracle/diag/rdbms/gdjb/gdjb1/trace/gdjb1_psp0_352314.trc:
  Process startup failed, error stack:
  Thu Nov 25 02:56:24 2010
  Process q000 died, see its trace file
  Thu Nov 25 02:56:13 2010
  ORA-27302: failure occurred at: skgpspawn3
  ORA-27301: OS failure message: Not enough space
  ORA-27300: OS system dependent operation:fork failed with status: 12
  Errors in file /oracle/app/oracle/diag/rdbms/gdjb/gdjb1/trace/gdjb1_psp0_352314.trc:
  Process startup failed, error stack:
  Instance terminated by USER, pid = 144242
  USER (ospid: 144242): terminating the instance due to error 443
  Process LMHB died, see its trace file
  ORA-27302: failure occurred at: skgpspawn3
  ORA-27301: OS failure message: Not enough space
  ORA-27300: OS system dependent operation:fork failed with status: 12
  Errors in file /oracle/app/oracle/diag/rdbms/gdjb/gdjb1/trace/gdjb1_ora_144242.trc:
  p570a节点数据库down机是由于物理内存和页面交换空间已经使用完,无法得到请求引起的。
  3.3   Listener.log
  TNS-12500: TNS:监听器未能启动专用的服务器进程
  TNS-12540: TNS:超出内部极限限制
  TNS-12560: TNS:协议适配器错误
  TNS-00510:超出内部极限限制
  IBM/AIX RISC System/6000 Error: 12: Not enough space
  监听日志也报无法请求外部连接错误。
  3.4   检查物理内存和oracle内存参数
  物理内存
  p570a
  AIX
  System Model: IBM,9117-MMA
  Machine Serial Number: 066E9C5
  Processor Type: PowerPC_POWER6
  Processor Implementation Mode: POWER 6
  Processor Version: PV_6_Compat
  Number Of Processors: 8
  Processor Clock Speed: 3504 MHz
  CPU Type: 64-bit
  Kernel Type: 64-bit
  LPAR Info: 1 06-6E9C5
  Memory>
  Good Memory>
  Platform. Firmware level: EM350_038
  Firmware Version: IBM,EM350_038
  Console Login: enable
  Auto Restart: true
  Full Core: false
  可以看出总物理内存为15G左右
  数据库A
  SQL> show sga
  Total System Global Area 2137886720 bytes
  Fixed>
  Variable>
  Database Buffers         922746880 bytes
  Redo Buffers               4968448 bytes
  SQL> show parameter sga
  NAME                              TYPE       VALUE
  ------------------------------------ ----------- ------------------------------
  lock_sga                            boolean    FALSE
  pre_page_sga                        boolean    FALSE
  sga_max_size                        big integer 2G
  sga_target                        big integer 2G
  SQL> show parameter pga
  NAME                              TYPE       VALUE
  ------------------------------------ ----------- ------------------------------
  pga_aggregate_target                big integer 1G
  SQL> show parameter instance_name
  NAME                              TYPE       VALUE
  ------------------------------------ ----------- ------------------------------
  instance_name                     string   gd1
  可以看出A数据库占用3G物理内存
  数据库B
  SQL> show sga
  Total System Global Area 8551575552 bytes
  Fixed>
  Variable>
  Database Buffers      6761218048 bytes
  Redo Buffers               9748480 bytes
  SQL> show parameter sga
  NAME                              TYPE    VALUE
  lock_sga                            Boolean FALSE
  pre_page_sga                        Boolean FALSE
  sga_max_size                        big integer 8G
  sga_target                        big integer 8G
  SQL> show parameter instance_name
  NAME                              TYPE       VALUE
  ------------------------------------ ----------- ------------------------------
  instance_name                     string   gd2
  SQL> show parameter pga
  NAME                              TYPE            VALUE
  pga_aggregate_target                big integer      2G
  可以看出B数据库占用10G物理内存,分配的值占用总内存较多。
  4         总结及建议
  4.1   故障原因分析
  总物理内存15G,分配给两个数据库总共内存13G,只剩2G给操作系统使用,随着业务连接数增多或不释放等原因,很容易把物理内存和页面交换空间耗用完,导致数据库down机和主机挂起。
  4.2   已采取措施和建议
  1)gzcdc数据库oracle内存参数值设置过大,建议调整,跟开发商,用户商量后,将gzcdc数据库sga调整为5G,pga设置为1G,这样操作系统还剩余7G。

页: [1]
查看完整版本: 环境:oracle 11.2.0.1 +aix6内存问题