2025-11-11 12:21:57.020 [GIPCD(127382)]CRS-7517: The Oracle Grid Interprocess Communication (GIPC) failed to identify the Fast Node Death Detection (FNDD).
2025-11-11 12:23:07.241 [OCSSD(130761)]CRS-1621: The IPMI configuration data for this node stored in the Oracle registry is incomplete; details at (:CSSNK00002:) in /u01/app/grid/diag/crs/orcl02/crs/trace/ocssd.trc 2025-11-11 12:23:07.241 [OCSSD(130761)]CRS-1617: The information required to do node killfor node orcl02 is incomplete; details at (:CSSNM00004:) in /u01/app/grid/diag/crs/orcl02/crs/trace/ocssd.trc 2025-11-11 12:23:37.324 [OCSSD(130761)]CRS-7500: The Oracle Grid Infrastructure process 'ocssd' failed to establish Oracle Grid Interprocess Communication (GIPC) high availability connection with remote node 'orcl01'.
2025-11-11 12:28:39.020 [OCSSD(130761)]CRS-7500: The Oracle Grid Infrastructure process 'ocssd' failed to establish Oracle Grid Interprocess Communication (GIPC) high availability connection with remote node 'orcl01'.
2025-11-11 12:32:58.470 [OCSSD(130761)]CRS-1609: This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00086:) in /u01/app/grid/diag/crs/orcl02/crs/trace/ocssd.trc. 2025-11-11 12:32:58.469 [CSSDAGENT(130651)]CRS-5818: Aborted command'start'for resource 'ora.cssd'. Details at (:CRSAGF00113:) {0:5:4} in /u01/app/grid/diag/crs/orcl02/crs/trace/ohasd_cssdagent_root.trc. 2025-11-11 12:32:58.506 [OHASD(126708)]CRS-2757: Command 'Start' timed out waiting for response from the resource 'ora.cssd'. Details at (:CRSPE00221:) {0:5:4} in /u01/app/grid/diag/crs/orcl02/crs/trace/ohasd.trc. 2025-11-11 12:32:59.470 [OCSSD(130761)]CRS-1656: The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/grid/diag/crs/orcl02/crs/trace/ocssd.trc 2025-11-11 12:32:59.470 [OCSSD(130761)]CRS-1603: CSSD on node orcl02 has been shut down. 2025-11-11 12:33:00.151 [OCSSD(130761)]CRS-1609: This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00086:) in /u01/app/grid/diag/crs/orcl02/crs/trace/ocssd.trc. 2025-11-11T12:33:04.480316+08:00 Errors in file /u01/app/grid/diag/crs/orcl02/crs/trace/ocssd.trc (incident=17): CRS-8503 [] [] [] [] [] [] [] [] [] [] [] [] Incident details in: /u01/app/grid/diag/crs/orcl02/crs/incident/incdir_17/ocssd_i17.trc
2025-11-11 12:33:04.471 [OCSSD(130761)]CRS-8503: Oracle Clusterware process OCSSD with operating system process ID 130761 experienced fatal signal or exception code 6.
2025-11-11 13:07:10.563 : CSSD:909854464: [ INFO] clssnmvDHBValidateNCopy: node 1, orcl01, has a disk HB, but no network HB, DHB has rcfg 658329638, wrtcnt, 36406, LATS 5334194, lastSeqNo 36403, uniqueness 1762834827, timestamp 1762837626/5323124 2025-11-11 13:07:10.564 : CSSD:897136384: [ INFO] clssscSelect: gipcwait returned with status gipcretTimeout (16)
日志表明心跳网络存在通信问题。
问题分析
初步排查
首先检查基础网络连通性:
1 2 3 4 5
[root@orcl02:/tmp/mcasttest]# ping orcl01-priv PING orcl01-priv (1.1.1.1) 56(84) bytes of data. 64 bytes from orcl01-priv (1.1.1.1): icmp_seq=1 ttl=64 time=0.053 ms 64 bytes from orcl01-priv (1.1.1.1): icmp_seq=2 ttl=64 time=0.044 ms 64 bytes from orcl01-priv (1.1.1.1): icmp_seq=3 ttl=64 time=0.116 ms
心跳 IP 可以正常 ping 通,防火墙已关闭。
MOS 文档参考
查询 MOS 文档,发现类似案例:OCI DBCS : Failed to start CRS on first RAC node - (GIPC) failed to identify the Fast Node Death Detection (FNDD). (Doc ID 2969313.1)
该案例指出问题根源在于节点间 MTU 配置不一致:
MTU 配置检查
检查两个节点的 MTU 配置:
1 2 3 4 5 6 7
## 节点1 [root@orcl01:/home/grid]# ifconfig bond1|grep mtu bond1: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 9000
## 节点2 [root@orcl02:/home/grid]$ ifconfig bond1|grep mtu bond1: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 9000
Source Destination Connected? ------------------------------ ------------------------------ ---------------- orcl01[bond1:1.1.1.1] orcl02[bond1:1.1.1.2] yes Check that maximum (MTU) size packet goes through subnet ...FAILED (PRVG-12885, PRVG-12884, PRVG-2043) subnet mask consistency for subnet "192.168.6.0" ...PASSED subnet mask consistency for subnet "1.1.1.0" ...PASSED Node Connectivity ...FAILED (PRVG-12885, PRVG-12884, PRVG-2043) Multicast or broadcast check ... Checking subnet "1.1.1.0"for multicast communication with multicast group "224.0.0.251" Multicast or broadcast check ...PASSED
Verification of node connectivity was unsuccessful on all the specified nodes.
Failures were encountered during execution of CVU verification request "node connectivity".
Node Connectivity ...FAILED Check that maximum (MTU) size packet goes through subnet ...FAILED PRVG-12885 : ICMP packet of MTU size "9000" does not go through subnet "1.1.1.0". PRVG-12884 : Maximum (MTU) size packet check failed on subnets "1.1.1.0"
orcl01: PRVG-2043 : Command "/bin/ping 1.1.1.2 -c 1 -w 3 -M do -s 8972 " failed on node "orcl01" and produced the following output: PING 1.1.1.2 (1.1.1.2) 8972(9000) bytes of data.
Node Connectivity ...FAILED Check that maximum (MTU) size packet goes through subnet ...FAILED PRVG-12885 : ICMP packet of MTU size "9000" does not go through subnet "1.1.1.0". PRVG-12884 : Maximum (MTU) size packet check failed on subnets "1.1.1.0"
orcl01: PRVG-2043 : Command "/bin/ping 1.1.1.2 -c 1 -w 3 -M do -s 8972 " failed on node "orcl01" and produced the following output: PING 1.1.1.2 (1.1.1.2) 8972(9000) bytes of data.