大家好,这里是公众号 DBA学习之路 ,分享一些学习数据库路上的知识和经验。
@TOC
前言 今天检查一套 Oracle RAC 12.2.0.1 数据库,检查集群状态时,发现集群命令一直夯着没反应:
经过一顿分析,终于解决问题,比较简单,这里分享一下处理过程。
问题分析 首先怀疑是集群资源挂了,查看集群资源,发现 ora.crsd 挂了:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 [grid@lucifer1 ~]$ crsctl stat res -t -init -------------------------------------------------------------------------------- Name Target State Server State details -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.asm 1 ONLINE ONLINE mesdb0 Started,STABLE ora.cluster_interconnect.haip 1 ONLINE ONLINE mesdb0 STABLE ora.crf 1 ONLINE ONLINE mesdb0 STABLE ora.crsd 1 ONLINE OFFLINE STABLE ora.cssd 1 ONLINE ONLINE mesdb0 STABLE ora.cssdmonitor 1 ONLINE ONLINE mesdb0 STABLE ora.ctssd 1 ONLINE ONLINE mesdb0 OBSERVER,STABLE ora.diskmon 1 OFFLINE OFFLINE STABLE ora.evmd 1 ONLINE ONLINE mesdb0 STABLE ora.gipcd 1 ONLINE ONLINE mesdb0 STABLE ora.gpnpd 1 ONLINE ONLINE mesdb0 STABLE ora.mdnsd 1 ONLINE ONLINE mesdb0 STABLE ora.storage 1 ONLINE ONLINE mesdb0 STABLE -------------------------------------------------------------------------------- [grid@lucifer2 ~]$ crsctl stat res -t CRS-4535: Cannot communicate with Cluster Ready Services CRS-4000: Command Status failed, or completed with errors.
检查 crs alert.log 日志:
1 2 3 4 5 6 7 8 9 10 2025-02-24 06:11:42.105 [ORAROOTAGENT(29459)]CRS-8500: Oracle Clusterware ORAROOTAGENT 进程以操作系统进程 ID 29459 开头 2025-02-24 06:12:42.142 [ORAROOTAGENT(29459)]CRS-5818: 已中止命令 'check' (对于资源 'ora.crsd' )。详细资料见 (:CRSAGF00113:) {0:15:2} (位于 /oracle/app/grid/diag/crs/mesdb0/crs/trace/ohasd_orarootagent_root.trc)。 2025-02-24 06:13:20.260 [CRSD(30357)]CRS-8500: Oracle Clusterware CRSD 进程以操作系统进程 ID 30357 开头 2025-02-24 06:13:22.541 [CRSD(30357)]CRS-1019: 主机 mesdb0 上的 OCR 服务已退出。详细资料见 /oracle/app/grid/diag/crs/mesdb0/crs/trace/crsd.trc 2025-02-24T06:13:22.563713+08:00 Errors in file /oracle/app/grid/diag/crs/mesdb0/crs/trace/crsd.trc (incident=41): CRS-1019 [] [] [] [] [] [] [] [] [] [] [] [] Incident details in : /oracle/app/grid/diag/crs/mesdb0/crs/incident/incdir_41/crsd_i41.trc 2025-02-24 06:13:22.584 [CRSD(30357)]CRS-8505: Oracle Clusterware CRSD 进程 (具有操作系统进程 ID 30357) 遇到内部错误 CRS-01019
检查 crsd 日志:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 2025-02-24 06:13:22.514 : OCRMSG:3187623680: prom_listen: Port str [a0f4-81a3-c06c-03aa] 2025-02-24 06:13:22.514 : OCRSRV:3187623680: proath_listen: listening to remote requests at portstr [a0f4-81a3-c06c-03aa] 2025-02-24 06:13:22.518 : OCRMSG:3168728832: prom_listen: Port str [ab1d-0688-2d30-7387] 2025-02-24 06:13:22.518 : OCRSRV:3168728832: th_invalidate_cache: listening to cache_invalidation requests at portstr [ab1d-0688-2d30-7387] 2025-02-24 06:13:22.522 : OCRMSG:3166627584: prom_listen: Port str [c71c-c1a3-dc88-994f] 2025-02-24 06:13:22.522 : OCRSRV:3166627584: proath_listen: listening to remote rim requests at portstr [c71c-c1a3-dc88-994f] 2025-02-24 06:13:22.533 : OCRMAS:3164526336: th_calc_av: Configured Active Patch Level [0] 2025-02-24 06:13:22.533 : OCRMAS:3164526336: th_calc_av:5'' : Return persisted APL [0] OCRMAS:3164526336: th_calc_av:5': Return persisted AV [203424000] [12.2.0.1.0] 2025-02-24 06:13:22.535 : OCRMAS:3164526336: th_master_prereg: Persistent upgrade state retrieved from OCR is [0]. 2025-02-24 06:13:22.537 : OCRMAS:3164526336: th_master_prereg: Persistent upgrade toversion buffer retrieved from OCR is [12.2.0.1.0]. Setting toversion to [203424000]. 2025-02-24 06:13:22.541 : CSSCLNT:3164526336: clssgsGroupJoin: member in use group(1/ocrlocal) 2025-02-24 06:13:22.541 : default:3164526336: procr_reg_localgrp: Error [14] from clssgsreglocalgrp(). Return [23]. 2025-02-24 06:13:22.541 : default:3164526336: SLOS : [clsuSlosFormatDiag called with non-error slos.] 2025-02-24 06:13:22.541 : OCRMAS:3164526336: th_master_register: Failed to register in OCRLOCAL group. Retval:[23] 2025-02-24 06:13:22.541 : OCRAPI:3164526336: procr_ctx_set_invalid: ctx is in state [6]. 2025-02-24 06:13:22.541 : OCRAPI:3164526336: procr_ctx_set_invalid: ctx set to invalid Trace file /oracle/app/grid/diag/crs/mesdb0/crs/trace/crsd.trc Oracle Database 12c Clusterware Release 12.2.0.1.0 - Production Copyright 1996, 2016 Oracle. All rights reserved. DDE: Flood control is not active 2025-02-24T06:13:22.564565+08:00 Incident 41 created, dump file: /oracle/app/grid/diag/crs/mesdb0/crs/incident/incdir_41/crsd_i41.trc CRS-1019 [] [] [] [] [] [] [] [] [] [] [] [] 2025-02-24 06:13:22.706 : OCRAPI:3164526336: procr_ctx_set_invalid: Aborting... Trace file /oracle/app/grid/diag/crs/mesdb0/crs/trace/crsd.trc Oracle Database 12c Clusterware Release 12.2.0.1.0 - Production Copyright 1996, 2016 Oracle. All rights reserved. default:2552033344: 1: clskec:has:CLSU:910 4 args[CLSD00302][mod=clsdadr.c][loc=(:CLSD00302:)][msg=clsdAdrInit: Trace file size and number of segments fetched from environemnt variable: ORA_DAEMON_TRACE_FILE_OPTIONS filesize=26214400,numsegments=10] CLSB:2552033344: Argument count (argc) for this daemon is 2 CLSB:2552033344: Argument 0 is: /oracle/app/12.2.0/grid/bin/crsd.bin CLSB:2552033344: Argument 1 is: reboot 2025-02-24 06:13:22.829 : CSSCLNT:2552033344: clsssinit: initialized context: (0x4edf930) flags 0x207 2025-02-24 06:13:22.829 : CRSMAIN:2552033344: First attempt: init CSS context succeeded. 2025-02-24 06:13:22.829 : CRSMAIN:2552033344: Start mode: normal 2025-02-24 06:13:22.831 : CLSDMT:2343307008: PID for the Process [30402], connkey CRSD 2025-02-24 06:13:23.745 : CRSMAIN:2552033344: CRS Daemon Starting 2025-02-24 06:13:23.745 : CRSMAIN:2343307008: Process environment is not initialized yet! 2025-02-24 06:13:23.746 : CRSD:2552033344: Logging level for Module: clsdadr 0 2025-02-24 06:13:23.746 : CRSD:2552033344: Logging level for Module: clsdnreg 0 2025-02-24 06:13:23.746 : CRSD:2552033344: Logging level for Module: clsdynam 0
查看 trace dump 日志:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 ----- Invocation Context Dump ----- Address: 0x7f1a9c024340 Phase: 3 flags: 0x10E0000 Incident ID: 41 Error Descriptor: CRS-1019 [] [] [] [] [] [] [] [] [] [] [] [] Error class: 0 Problem Key Number of actions: 10 ----- Incident Context Dump ----- Address: 0x7f1abc9d99d0 Incident ID: 41 Problem Key: CRS 1019 Error: CRS-1019 [] [] [] [] [] [] [] [] [] [] [] [] [00]: dbgePostErrorDirectVaList_int [diag_dde] [01]: dbgePostErrorDirect [diag_dde] [02]: clsdAdrPostError [] [03]: clsdadrpr_CreateIncidentCheck [] [04]: clsdadrprAlert [] [05]: clsd_alertprintft [] [06]: proath_master_exit_helper []<-- Signaling [07]: proath_master_register [] [08]: proath_master [] [09]: start_thread [] MD [00]: 'Client ProcId' ='crsd.bin@mesdb0.30357_139752810403584' (0x0) Impact 0: Impact 1: Impact 2: Impact 3: Derived Impact: ----- END Incident Context Dump -----
看着很像是 BUG,在 MOS 搜索后发现有一个文章很匹配:
crsd.bin Fail With Error CRS-1019 When ohasd Restarted (Doc ID 2291799.1)
Bug 24396050 - crsd.bin failed several times with error CRS-1019 (Doc ID 24396050.8)
MOS 截图如下:
MOS 内容与问题日志完全一致,确认是 BUG,需要进行补丁修复:
解决问题 下载 BUG 所需补丁 Patch 24396050: LNX64-12.2-CRS: CRSD.BIN FAILED SEVERAL TIMES WITH ERROR CRS-1019
更新 OPatch 查看补丁 README:You must use the OPatch utility version 12.2.0.1.5 or later to apply this patch.
检查当前 OPatch 补丁是否符合要求:
1 2 3 4 5 [grid@mesdb0 ~]$ cd $ORACLE_HOME /OPatch/ [grid@mesdb0 OPatch]$ ./opatch version OPatch Version: 12.2.0.1.6 OPatch succeeded.
符合要求,不需要更新 OPatch。
解压补丁 1 2 3 unzip -q /soft/p24396050_122010_Linux-x86-64.zip -d /soft/ chown -R oracle:oinstall /soft/24396050
安装补丁 1 2 3 4 5 6 7 8 export GI_HOME=/oracle/app/12.2.0/grid$GI_HOME /OPatch/opatchauto apply /soft/24396050 -analyze$GI_HOME /OPatch/opatchauto apply /soft/24396050 -oh $GI_HOME
安装补丁后重启系统验证集群已经恢复正常。