马上注册,结交更多数据大咖,获取更多知识干货,轻松玩转大数据
您需要 登录 才可以下载或查看,没有帐号?立即注册
x
本帖最后由 dbdream 于 2015-2-3 09:58 编辑
客户环境:RHEL 5.7、 Oracle 11.2.0.4.0 forRAC 故障描述:应用程序无法连接数据库,节点1公有网卡、私有网卡ping不通,节点2公有网卡、私有网卡可以ping通,vip及scan ip均ping不通。 故障原因:节点1所在服务器网卡控制器故障,导致节点1所在服务器的网卡不能正常工作,两个节点的心跳中断,悲催的是节点1持有仲裁盘,节点2被驱逐,节点2的集群被关闭,节点1的集群正常运行,可是节点1的网络已经瘫痪,导致应用程序无法连接数据库。 由于节点1的公有IP和私有IP不再同一网段,切都不通,而且只有这一台服务器出现网络故障,基本可以排除交换机的问题,而且两个网卡都不可用,因为多个网卡是通过一个网卡控制器控制的,所以最大的可能性就是网卡控制器的问题了。 分析过程:以下是故障分析过程及部分故障信息。由于本次为远程技术支持,节点1只能由同事在机房操作,故缺失节点1的部分日志信息。 远程连接节点2,发现集群已经关闭。 [AppleScript] 纯文本查看 复制代码 [root@pressdb4 ~]$ ps -ef | grep smon
root 10434 1 0 2014 ? 03:35:06 /u01/app/grid/product/11.2.0/grid_home1/bin/osysmond.bin
root 13647 13476 0 19:27 pts/1 00:00:00 grep smon
[root@pressdb4 bin]# ps -ef | grep has
root 9819 1 0 2014 ? 02:00:33 /u01/app/grid/product/11.2.0/grid_home1/bin/ohasd.bin reboot
root 9947 1 0 2014 ? 00:00:00 /bin/sh /etc/init.d/init.ohasd run
root 12170 32728 0 19:25 pts/1 00:00:00 grep has
[root@pressdb4 bin]# ps -ef | grep ora
root 9491 9475 0 2014 ? 00:05:09 hald-addon-storage: polling /dev/hda
root 10297 1 0 2014 ? 04:40:46 /u01/app/grid/product/11.2.0/grid_home1/jdk/jre/bin/java -Xms64m -Xmx256m -classpath
此时由于两个节点的心跳网络没有恢复,仲裁盘被节点1持有,节点2的集群无法启动。 [AppleScript] 纯文本查看 复制代码 [root@pressdb4 bin]# ./crsctl start crs
CRS-4640: Oracle High Availability Services is already active
CRS-4000: Command Start failed, or completed with errors.
查看节点2的集群告警日志可以发现以下信息: [AppleScript] 纯文本查看 复制代码 2015-01-28 06:16:47.621:
[cssd(8857)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/grid/product/11.2.0/grid_home1/log/pressdb4/cssd/ocssd.log
2015-01-28 06:16:47.621:
[cssd(8857)]CRS-1603:CSSD on node pressdb4 shutdown by user.
2015-01-28 06:16:52.840:
[ohasd(9819)]CRS-2765:Resource 'ora.cssdmonitor' has failed on server 'pressdb4'.
2015-01-28 06:16:52.880:
[ohasd(9819)]CRS-2878:Failed to restart resource 'ora.cssd'
2015-01-28 06:16:54.312:
[cssd(10733)]CRS-1713:CSSD daemon is started in clustered mode
2015-01-28 06:17:00.114:
[cssd(10733)]CRS-1707:Lease acquisition for node pressdb4 number 2 completed
2015-01-28 06:17:01.397:
[cssd(10733)]CRS-1605:CSSD voting file is online: /dev/asm-disk1; details in /u01/app/grid/product/11.2.0/grid_home1/log/pressdb4/cssd/ocssd.log.
查看节点1和节点2的css日志可以发现心跳网络出现问题。 [AppleScript] 纯文本查看 复制代码 2015-01-28 02:54:18.763: [ CSSD][1096313152]clssnmPollingThread: node pressdb3/4 (1) at 50% heartbeat fatal, removal in 14.510 seconds
2015-01-28 02:54:25.776: [ CSSD][1096313152]clssnmPollingThread: node pressdb3/4 (1) at 75% heartbeat fatal, removal in 7.500 seconds
2015-01-28 02:54:30.789: [ CSSD][1096313152]clssnmPollingThread: node pressdb3/4 (1) at 90% heartbeat fatal, removal in 2.490 seconds, seedhbimpd 1
因为私有网络中断,两个节点之间的心跳不通,导致两个节点互相驱逐。 节点2驱逐节点1。 [AppleScript] 纯文本查看 复制代码 2015-01-28 02:54:33.276: [ CSSD][1096313152]clssnmMarkNodeForRemoval: node 1, pressdb3 marked for removal
2015-01-28 02:54:33.276: [ CSSD][1096313152]clssnmDiscHelper: pressdb3, node(1) connection failed, endp (0x268fd57), probe((nil)), ninf->endp 0x268fd57
节点1驱逐节点2。 [AppleScript] 纯文本查看 复制代码 2015-01-28 02:55:34.751: [ CSSD][1106127168]clssnmMarkNodeForRemoval: node 2, pressdb4 marked for removal
2015-01-28 02:55:34.751: [ CSSD][1106127168]clssnmDiscHelper: pressdb4, node(2) connection failed, endp (0x6e8), probe((nil)), ninf->endp 0x6e8
在2节点的RAC环境,心跳网络中断,ORACLE不好判断哪个节点才是故障节点,通常情况都是驱逐第二个节点,所以节点2很悲催的被驱逐了。 故障确认:经同事在机房验证,节点1的服务正常启动。 [AppleScript] 纯文本查看 复制代码 [root@pressdb3 bin]# ./crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora.DATA.dg ora....up.type ONLINE ONLINE pressdb3
ora....ER.lsnr ora....er.type ONLINE ONLINE pressdb3
ora....N1.lsnr ora....er.type ONLINE ONLINE pressdb3
ora.OCR.dg ora....up.type ONLINE ONLINE pressdb3
ora.asm ora.asm.type ONLINE ONLINE pressdb3
ora.cvu ora.cvu.type ONLINE ONLINE pressdb3
ora.gsd ora.gsd.type OFFLINE OFFLINE
ora....network ora....rk.type ONLINE ONLINE pressdb3
ora.oc4j ora.oc4j.type ONLINE OFFLINE
ora.ons ora.ons.type ONLINE ONLINE pressdb3
ora.pressdb.db ora....se.type ONLINE OFFLINE
ora....db3.vip ora....t1.type ONLINE ONLINE pressdb3
ora....SM2.asm application ONLINE ONLINE pressdb3
ora....B4.lsnr application ONLINE ONLINE pressdb3
ora....db4.gsd application OFFLINE OFFLINE
ora....db4.ons application ONLINE ONLINE pressdb3
ora....db4.vip ora....t1.type ONLINE ONLINE pressdb3
ora....ry.acfs ora....fs.type ONLINE ONLINE pressdb3
ora.scan1.vip ora....ip.type ONLINE ONLINE pressdb3
这样,由于心跳网络不通,导致节点2无法正常启动,应用程序也无法连接节点1的数据库。 解决方法:关闭节点1的集群服务,释放相应的资源,节点2即可正常启动。 节点1关闭集群。 [AppleScript] 纯文本查看 复制代码 [root@pressdb3 bin]# ./crsctl stop crs 节点2集群可以正常启动。 [AppleScript] 纯文本查看 复制代码 [root@pressdb4 bin]# ./crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora.DATA.dg ora....up.type ONLINE ONLINE pressdb4
ora....ER.lsnr ora....er.type ONLINE ONLINE pressdb4
ora....N1.lsnr ora....er.type ONLINE ONLINE pressdb4
ora.OCR.dg ora....up.type ONLINE ONLINE pressdb4
ora.asm ora.asm.type ONLINE ONLINE pressdb4
ora.cvu ora.cvu.type ONLINE ONLINE pressdb4
ora.gsd ora.gsd.type OFFLINE OFFLINE
ora....network ora....rk.type ONLINE ONLINE pressdb4
ora.oc4j ora.oc4j.type ONLINE OFFLINE
ora.ons ora.ons.type ONLINE ONLINE pressdb4
ora.pressdb.db ora....se.type ONLINE OFFLINE
ora....db3.vip ora....t1.type ONLINE ONLINE pressdb4
ora....SM2.asm application ONLINE ONLINE pressdb4
ora....B4.lsnr application ONLINE ONLINE pressdb4
ora....db4.gsd application OFFLINE OFFLINE
ora....db4.ons application ONLINE ONLINE pressdb4
ora....db4.vip ora....t1.type ONLINE ONLINE pressdb4
ora....ry.acfs ora....fs.type ONLINE ONLINE pressdb4
ora.scan1.vip ora....ip.type ONLINE ONLINE pressdb4
由于节点2所在服务器的网络没有问题,此时应用程序已经可以正常访问数据库。至于节点1,服务器修好后,正常启动集群即可。
来自群组: Oracle DBA交流 |