大家好,这里是公众号 DBA学习之路 ,分享一些学习国产数据库路上的知识和经验。
目录
前言 今天检查一套 Oracle RAC 12.2.0.1 数据库,检查集群状态时,发现集群命令一直夯着没反应:
经过一顿分析,终于解决问题,比较简单,这里分享一下处理过程。
问题分析 首先怀疑是集群资源挂了,查看集群资源,发现 ora.crsd 挂了:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 ## 节点一 [grid@lucifer1 ~]$ crsctl stat res -t -init-------------------------------------------------------------------------------- Name Target State Server State details -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.asm 1 ONLINE ONLINE mesdb0 Started,STABLE ora.cluster_interconnect.haip 1 ONLINE ONLINE mesdb0 STABLE ora.crf 1 ONLINE ONLINE mesdb0 STABLE ora.crsd 1 ONLINE OFFLINE STABLE ora.cssd 1 ONLINE ONLINE mesdb0 STABLE ora.cssdmonitor 1 ONLINE ONLINE mesdb0 STABLE ora.ctssd 1 ONLINE ONLINE mesdb0 OBSERVER,STABLE ora.diskmon 1 OFFLINE OFFLINE STABLE ora.evmd 1 ONLINE ONLINE mesdb0 STABLE ora.gipcd 1 ONLINE ONLINE mesdb0 STABLE ora.gpnpd 1 ONLINE ONLINE mesdb0 STABLE ora.mdnsd 1 ONLINE ONLINE mesdb0 STABLE ora.storage 1 ONLINE ONLINE mesdb0 STABLE -------------------------------------------------------------------------------- ## 节点二 [grid@lucifer2 ~]$ crsctl stat res -t CRS-4535: Cannot communicate with Cluster Ready Services CRS-4000: Command Status failed, or completed with errors.
检查 crs alert.log 日志:
1 2 3 4 5 6 7 8 9 10 2025 -02 -24 06 :11 :42.105 [ORAROOTAGENT(29459 )]CRS-8500 : Oracle Clusterware ORAROOTAGENT 进程以操作系统进程 ID 29459 开头2025 -02 -24 06 :12 :42.142 [ORAROOTAGENT(29459 )]CRS-5818 : 已中止命令 'check' (对于资源 'ora.crsd' )。详细资料见 (:CRSAGF00113:) {0 :15 :2 } (位于 /oracle/ app/grid/ diag/crs/m esdb0/crs/ trace/ohasd_orarootagent_root.trc)。2025 -02 -24 06 :13 :20.260 [CRSD(30357 )]CRS-8500 : Oracle Clusterware CRSD 进程以操作系统进程 ID 30357 开头2025 -02 -24 06 :13 :22.541 [CRSD(30357 )]CRS-1019 : 主机 mesdb0 上的 OCR 服务已退出。详细资料见 /oracle/ app/grid/ diag/crs/m esdb0/crs/ trace/crsd.trc2025 -02 -24 T06:13 :22.563713 +08 :00 Errors in file /oracle/ app/grid/ diag/crs/m esdb0/crs/ trace/crsd.trc (incident=41 ): CRS-1019 [] [] [] [] [] [] [] [] [] [] [] [] Incident details in: /oracle/ app/grid/ diag/crs/m esdb0/crs/i ncident/incdir_41/ crsd_i41.trc2025 -02 -24 06 :13 :22.584 [CRSD(30357 )]CRS-8505 : Oracle Clusterware CRSD 进程 (具有操作系统进程 ID 30357 ) 遇到内部错误 CRS-01019
检查 crsd 日志:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 2025 -02 -24 06 :13 :22 .514 : OCRMSG:3187623680 : prom_listen: Port str [a0f4-81a3-c06c-03aa] 2025 -02 -24 06 :13 :22 .514 : OCRSRV:3187623680 : proath_listen: listening to remote requests at portstr [a0f4-81a3-c06c-03aa] 2025 -02 -24 06 :13 :22 .518 : OCRMSG:3168728832 : prom_listen: Port str [ab1d-0688-2d30-7387] 2025 -02 -24 06 :13 :22 .518 : OCRSRV:3168728832 : th_invalidate_cache: listening to cache_invalidation requests at portstr [ab1d-0688-2d30-7387] 2025 -02 -24 06 :13 :22 .522 : OCRMSG:3166627584 : prom_listen: Port str [c71c-c1a3-dc88-994f] 2025 -02 -24 06 :13 :22 .522 : OCRSRV:3166627584 : proath_listen: listening to remote rim requests at portstr [c71c-c1a3-dc88-994f] 2025 -02 -24 06 :13 :22 .533 : OCRMAS:3164526336 : th_calc_av: Configured Active Patch Level [0] 2025 -02 -24 06 :13 :22 .533 : OCRMAS:3164526336 : th_calc_av:5 '': Return persisted APL [0] OCRMAS :3164526336 : th_calc_av:5 ': Return persisted AV [203424000] [12.2.0.1.0] 2025 -02 -24 06 :13 :22 .535 : OCRMAS:3164526336 : th_master_prereg: Persistent upgrade state retrieved from OCR is [0]. 2025-02-24 06:13:22.537 : OCRMAS:3164526336: th_master_prereg: Persistent upgrade toversion buffer retrieved from OCR is [12.2.0.1.0]. Setting toversion to [203424000]. 2025-02-24 06:13:22.541 : CSSCLNT:3164526336: clssgsGroupJoin: member in use group(1/ocrlocal) 2025-02-24 06:13:22.541 : default:3164526336: procr_reg_localgrp: Error [14] from clssgsreglocalgrp(). Return [23]. 2025-02-24 06:13:22.541 : default:3164526336: SLOS : [clsuSlosFormatDiag called with non-error slos.] 2025 -02 -24 06 :13 :22 .541 : OCRMAS:3164526336 : th_master_register: Failed to register in OCRLOCAL group. Retval:[23 ]2025 -02 -24 06 :13 :22 .541 : OCRAPI:3164526336 : procr_ctx_set_invalid: ctx is in state [6]. 2025-02-24 06:13:22.541 : OCRAPI:3164526336: procr_ctx_set_invalid: ctx set to invalid Trace file /oracle/app/grid/diag/crs/mesdb0/crs/trace/crsd.trc Oracle Database 12c Clusterware Release 12.2.0.1.0 - Production Copyright 1996, 2016 Oracle. All rights reserved. DDE: Flood control is not active 2025-02-24T06:13:22.564565+08:00 Incident 41 created, dump file: /oracle/app/grid/diag/crs/mesdb0/crs/incident/incdir_41/crsd_i41.trc CRS-1019 [] [] [] [] [] [] [] [] [] [] [] [] 2025 -02 -24 06 :13 :22 .706 : OCRAPI:3164526336 : procr_ctx_set_invalid: Aborting...Trace file /oracle/app/grid/diag/crs/mesdb0/crs/trace/crsd.trcOracle Database 12 c Clusterware Release 12.2.0.1 .0 - Production Copyright 1996 , 2016 Oracle. All rights reserved. default :2552033344 : 1 : clskec:has:CLSU:910 4 args[CLSD00302][mod=clsdadr.c][loc=(:CLSD00302:)][msg=clsdAdrInit: Trace file size and number of segments fetched from environemnt variable: ORA_DAEMON_TRACE_FILE_OPTIONS filesize=26214400 ,numsegments=10 ] CLSB :2552033344 : Argument count (argc) for this daemon is 2 CLSB :2552033344 : Argument 0 is: /oracle/app/12 .2 .0 /grid/bin/crsd.bin CLSB :2552033344 : Argument 1 is: reboot2025 -02 -24 06 :13 :22 .829 : CSSCLNT:2552033344 : clsssinit: initialized context: (0 x4edf930) flags 0 x2072025 -02 -24 06 :13 :22 .829 : CRSMAIN:2552033344 : First attempt: init CSS context succeeded.2025 -02 -24 06 :13 :22 .829 : CRSMAIN:2552033344 : Start mode: normal2025 -02 -24 06 :13 :22 .831 : CLSDMT:2343307008 : PID for the Process [30402], connkey CRSD 2025-02-24 06:13:23.745 : CRSMAIN:2552033344: CRS Daemon Starting 2025-02-24 06:13:23.745 : CRSMAIN:2343307008: Process environment is not initialized yet! 2025-02-24 06:13:23.746 : CRSD:2552033344: Logging level for Module: clsdadr 0 2025-02-24 06:13:23.746 : CRSD:2552033344: Logging level for Module: clsdnreg 0 2025-02-24 06:13:23.746 : CRSD:2552033344: Logging level for Module: clsdynam 0
查看 trace dump 日志:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 ----- Invocation Context Dump -----Address : 0x7f1a9c024340 Phase : 3 flags: 0x10E0000 Incident ID : 41 Error Descriptor : CRS -1019 [] [] [] [] [] [] [] [] [] [] [] []Error class: 0 Problem Key # of args: 0 Number of actions: 10 ----- Incident Context Dump -----Address : 0x7f1abc9d99d0 Incident ID : 41 Problem Key : CRS 1019 Error : CRS -1019 [] [] [] [] [] [] [] [] [] [] [] [] [00 ]: dbgePostErrorDirectVaList_int [diag_dde] [01 ]: dbgePostErrorDirect [diag_dde] [02 ]: clsdAdrPostError [] [03 ]: clsdadrpr_CreateIncidentCheck [] [04 ]: clsdadrprAlert [] [05 ]: clsd_alertprintft [] [06 ]: proath_master_exit_helper []<-- Signaling [07 ]: proath_master_register [] [08 ]: proath_master [] [09 ]: start_thread []MD [00 ]: 'Client ProcId' ='crsd.bin@mesdb0.30357_139752810403584' (0x0 )Impact 0 :Impact 1 :Impact 2 :Impact 3 :Derived Impact : ----- END Incident Context Dump -----
看着很像是 BUG,在 MOS 搜索后发现有一个文章很匹配:
crsd.bin Fail With Error CRS-1019 When ohasd Restarted (Doc ID 2291799.1)
Bug 24396050 - crsd.bin failed several times with error CRS-1019 (Doc ID 24396050.8)
MOS 截图如下:
MOS 内容与问题日志完全一致,确认是 BUG,需要进行补丁修复:
解决问题 下载 BUG 所需补丁 Patch 24396050: LNX64-12.2-CRS: CRSD.BIN FAILED SEVERAL TIMES WITH ERROR CRS-1019
更新 OPatch 查看补丁 README:You must use the OPatch utility version 12.2.0.1.5 or later to apply this patch.
检查当前 OPatch 补丁是否符合要求:
1 2 3 4 5 [grid@mesdb0 ~]$ cd $ORACLE_HOME /OPatch/ [grid@mesdb0 OPatch]$ ./opatch version OPatch Version: 12.2 .0.1 .6 OPatch succeeded.
符合要求,不需要更新 OPatch。
解压补丁 1 2 3 unzip -q /soft/p24396050_122010_Linux-x86-64.zip -d /soft/chown -R oracle:oinstall /soft/24396050
安装补丁 1 2 3 4 5 6 7 8 export GI_HOME=/oracle/app/12.2.0/grid$GI_HOME /OPatch/opatchauto apply /soft/24396050 -analyze$GI_HOME /OPatch/opatchauto apply /soft/24396050 -oh $GI_HOME
安装补丁后重启系统验证集群已经恢复正常。