CRS-4535:
Cannot communicate with Cluster Ready Services
Environment:
Enterprise Linux Enterprise Linux Server release 5.5
(Carthage)
Oracle Grid Infrastructure 11.2.0.1
Oracle database server 11.2.0.1
> crsctl check crs
CRS-4638: Oracle High
Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4529: Cluster
Synchronization Services is online
CRS-4533: Event
Manager is online
ocrcheck was successful from one node and fails in the
second node.
[root@atlracp01 crsd]#
ocrcheck
Status of Oracle
Cluster Registry is as follows :
Version : 3
Total space (kbytes) :
262120
Used space (kbytes) :
2392
Available space (kbytes) : 259728
ID : 1786641452
Device/File Name : /u01_CRS/ocr/ocr1
Device/File
integrity check succeeded
Device/File Name : /u02_CRS/ocr/ocr2
Device/File
integrity check succeeded
Device/File Name : /u03_CRS/ocr/ocr3
Device/File
integrity check succeeded
Device/File not configured
Device/File
not configured
Cluster registry integrity check succeeded
Logical corruption check succeeded
[root@atlracp02 ~]#
ocrcheck
PROT-602: Failed to
retrieve data from the cluster registry
PROC-26: Error while
accessing the physical storage Operating System error [No such file or
directory] [2]
So, the OCR itself is good and there is no corruption. I
checked the permissions, ownership, etc and they looked fine and the same in
both the nodes in the cluster. But, one of the cluster was definitely having
some issue.
So, do not attempt to restore OCR from backups.
Checking the crsd.log from
$GI_HOME/log/<host_name>/crsd
[root@atlracp02 crsd]#
tail -f crsd.log
2012-08-24
13:58:23.903: [
OCROSD][538518096]utopen:6m'': OCR location [/u01_CRS/ocr/ocr1] configured
is not a valid storage type. Rturn code [37].
2012-08-24
13:58:23.903: [
OCROSD][538518096]utopen:7:failed to open any OCR file/disk, errno=9, os
err string=Bad file descriptor
2012-08-24
13:58:23.903: [
OCRRAW][538518096]proprinit: Could not open raw device
2012-08-24
13:58:23.903: [ default][538518096]a_init:7!: Backend init unsuccessful : [26]
2012-08-24
13:58:24.908: [ OCROSD][538518096]NFS
file system /u01_CRS/ocr mounted with incorrect options
2012-08-24
13:58:24.909: [ OCROSD][538518096]WARNING:Expected
NFS mount options: wsize>=32768,rsize>=32768,hard,(noac | actimeo=0 |
acregmin=0,acregmax=0,acdirmin=0,acdirmax=0)
2012-08-24
13:58:24.909: [
OCROSD][538518096]utopen:6m'': OCR location [/u01_CRS/ocr/ocr1]
configured is not a valid storage type. Rturn code [37].
2012-08-24
13:58:24.909: [
OCROSD][538518096]utopen:7:failed to open any OCR file/disk, errno=9, os
err string=Bad file descriptor
2012-08-24
13:58:24.909: [
OCRRAW][538518096]proprinit: Could not open raw device
2012-08-24
13:58:24.909: [ default][538518096]a_init:7!: Backend init unsuccessful : [26]
2012-08-24
13:58:25.911: [ OCROSD][538518096]NFS
file system /u01_CRS/ocr mounted with incorrect options
2012-08-24
13:58:25.911: [ OCROSD][538518096]WARNING:Expected
NFS mount options: wsize>=32768,rsize>=32768,hard,(noac | actimeo=0 |
acregmin=0,acregmax=0,acdirmin=0,acdirmax=0)
2012-08-24
13:58:25.911: [
OCROSD][538518096]utopen:6m'': OCR location [/u01_CRS/ocr/ocr1]
configured is not a valid storage type. Rturn code [37].
2012-08-24
13:58:25.911: [
OCROSD][538518096]utopen:7:failed to open any OCR file/disk, errno=9, os
err string=Bad file descriptor
2012-08-24
13:58:25.911: [
OCRRAW][538518096]proprinit: Could not open raw device
2012-08-24
13:58:25.911: [ default][538518096]a_init:7!: Backend init unsuccessful : [26]
Checking mount options on the failing node.
[root@atlracp02 ~]#
cat /etc/fstab|grep -i ocr
atlprod01-node2:/vol/oracle_crs0labpp
/u01_CRS/ocr nfs
rw,bg,hard,nointr,rsize=32768,wsize=32768,tcp,vers=3,timeo=600,actimeo=0 0 0
atlprod01-node1:/vol/oracle_crs1labpp
/u02_CRS/ocr nfs
rw,bg,hard,nointr,rsize=32768,wsize=32768,tcp,vers=3,timeo=600,actimeo=0 0 0
atlprod01-node2:/vol/oracle_crs2labpp
/u03_CRS/ocr nfs rw,bg,hard,nointr,rsize=32768,wsize=32768,tcp,vers=3,timeo=600,actimeo=0
0 0
[root@atlracp02 ~]# cat /proc/mounts |grep -i
ocr
atlprod01-node2:/vol/oracle_crs0labpp
/u01_CRS/ocr nfs
rw,vers=3,rsize=65536,wsize=65536,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=atlprod01-node2
0 0
atlprod01-node1:/vol/oracle_crs1labpp
/u02_CRS/ocr nfs
rw,vers=3,rsize=65536,wsize=65536,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=atlprod01-node1
0 0
atlprod01-node2:/vol/oracle_crs2labpp
/u03_CRS/ocr nfs rw,vers=3,rsize=65536,wsize=65536,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=atlprod01-node2
0 0
atlprod01-node1:/vol/oracle_crs1labpp/.snapshot
/u02_CRS/ocr/.snapshot nfs
rw,vers=3,rsize=65536,wsize=65536,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=atlprod01-node1
0 0
The fstab entries look good to me.. It is the same as the
node where i do not have any issues. After exhausting all other options/checks,
decided to unmount and mount the OCR volumes on the failing node. Note, this is
a Netapp NFS cluster file system.
[root@atlracp02 crsd]#
df -Ph|grep -i ocr
atlprod01-node2:/vol/oracle_crs0labpp 1.0G
4.7M 1020M 1% /u01_CRS/ocr
atlprod01-node1:/vol/oracle_crs1labpp 1.0G
4.7M 1020M 1% /u02_CRS/ocr
atlprod01-node2:/vol/oracle_crs2labpp 1.0G
4.7M 1020M 1% /u03_CRS/ocr
[root@atlracp02 crsd]#
umount /u01_CRS/ocr
[root@atlracp02 crsd]#
df -Ph|grep -i ocr
atlprod01-node1:/vol/oracle_crs1labpp 1.0G
4.7M 1020M 1% /u02_CRS/ocr
atlprod01-node2:/vol/oracle_crs2labpp 1.0G
4.7M 1020M 1% /u03_CRS/ocr
[root@atlracp02 crsd]#
mount /u01_CRS/ocr
[root@atlracp02 crsd]#
df -Ph|grep -i ocr
atlprod01-node1:/vol/oracle_crs1labpp 1.0G
4.7M 1020M 1% /u02_CRS/ocr
atlprod01-node2:/vol/oracle_crs2labpp 1.0G
4.7M 1020M 1% /u03_CRS/ocr
atlprod01-node2:/vol/oracle_crs0labpp 1.0G
4.7M 1020M 1% /u01_CRS/ocr
The /proc/mounts entry for the re-mounted volume looked
more aligned to my fstab entry.
[root@atlracp02 crsd]#
cat /proc/mounts |grep -i ocr
atlprod01-node1:/vol/oracle_crs1labpp
/u02_CRS/ocr nfs
rw,vers=3,rsize=65536,wsize=65536,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=atlprod01-node1
0 0
atlprod01-node2:/vol/oracle_crs2labpp
/u03_CRS/ocr nfs
rw,vers=3,rsize=65536,wsize=65536,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=atlprod01-node2
0 0
atlprod01-node2:/vol/oracle_crs0labpp
/u01_CRS/ocr nfs
rw,vers=3,rsize=32768,wsize=32768,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=atlprod01-node2
0
Proceeding with unmounting and mounting the remaining 2
OCR volumes.
[root@atlracp02 crsd]#
umount /u02_CRS/ocr
[root@atlracp02 crsd]#
mount /u02_CRS/ocr
[root@atlracp02 crsd]#
cat /proc/mounts |grep -i ocr
atlprod01-node2:/vol/oracle_crs2labpp
/u03_CRS/ocr nfs
rw,vers=3,rsize=65536,wsize=65536,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=atlprod01-node2
0 0
atlprod01-node2:/vol/oracle_crs0labpp
/u01_CRS/ocr nfs
rw,vers=3,rsize=32768,wsize=32768,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=atlprod01-node2
0 0
atlprod01-node1:/vol/oracle_crs1labpp
/u02_CRS/ocr nfs
rw,vers=3,rsize=32768,wsize=32768,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=atlprod01-node1
0 0
[root@atlracp02 crsd]#
umount /u03_CRS/ocr
umount: /u03_CRS/ocr:
device is busy
umount: /u03_CRS/ocr:
device is busy
[root@atlracp02 crsd]#
umount -f /u03_CRS/ocr
umount2: Device or
resource busy
umount: /u03_CRS/ocr:
device is busy
umount2: Device or
resource busy
umount: /u03_CRS/ocr:
device is busy
Tried to do a lazy unmount.
[root@atlracp02 crsd]#
umount -l /u03_CRS/ocr
It worked..
[root@atlracp02
crsd]# cat /proc/mounts |grep -i ocr
atlprod01-node2:/vol/oracle_crs0labpp
/u01_CRS/ocr nfs
rw,vers=3,rsize=32768,wsize=32768,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=atlprod01-node2
0 0
atlprod01-node1:/vol/oracle_crs1labpp
/u02_CRS/ocr nfs rw,vers=3,rsize=32768,wsize=32768,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=atlprod01-node1
0 0
[root@atlracp02 crsd]#
mount /u03_CRS/ocr
[root@atlracp02 crsd]#
cat /proc/mounts |grep -i ocr
atlprod01-node2:/vol/oracle_crs0labpp
/u01_CRS/ocr nfs
rw,vers=3,rsize=32768,wsize=32768,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=atlprod01-node2
0 0
atlprod01-node1:/vol/oracle_crs1labpp
/u02_CRS/ocr nfs rw,vers=3,rsize=32768,wsize=32768,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=atlprod01-node1
0 0
atlprod01-node2:/vol/oracle_crs2labpp
/u03_CRS/ocr nfs
rw,vers=3,rsize=32768,wsize=32768,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=atlprod01-node2
0 0
Now, stopped and started crs and it came back fine.
crsctl check crs
CRS-4638: Oracle High
Availability Services is online
CRS-4537: Cluster
Ready Services is online
CRS-4529: Cluster
Synchronization Services is online
CRS-4533: Event
Manager is online
Thanks so much - that really helped me. Clusterware was down on one node because ocr mounts were not correct.
ReplyDeletePatrick
Hi...
ReplyDelete[root@rac1 crsd]# tail -f crsd.log
2018-07-04 15:00:22.377: [ OCRASM][1066164720]proprasmcl: asmhandle is NULL
2018-07-04 15:00:22.377: [ OCRRAW][1066164720]proprinit: Could not open raw device
2018-07-04 15:00:22.377: [ OCRASM][1066164720]proprasmcl: asmhandle is NULL
2018-07-04 15:00:22.378: [ OCRAPI][1066164720]a_init:16!: Backend init unsuccessful : [26]
2018-07-04 15:00:22.378: [ CRSOCR][1066164720] OCR context init failure. Error: PROC-26: Error while accessing the physical storage ASM error [SLOS: cat=7, opn=kgfoAl06, dep=15077, loc=kgfokge
ORA-15077: could not locate ASM instance serving a required diskgroup
] [7]
2018-07-04 15:00:22.378: [ CRSD][1066164720][PANIC] CRSD exiting: Could not init OCR, code: 26
2018-07-04 15:00:22.379: [ CRSD][1066164720] Done.
Hi..
ReplyDelete[root@rac1 crsd]# tail -f crsd.log
2018-07-04 15:00:22.377: [ OCRASM][1066164720]proprasmcl: asmhandle is NULL
2018-07-04 15:00:22.377: [ OCRRAW][1066164720]proprinit: Could not open raw device
2018-07-04 15:00:22.377: [ OCRASM][1066164720]proprasmcl: asmhandle is NULL
2018-07-04 15:00:22.378: [ OCRAPI][1066164720]a_init:16!: Backend init unsuccessful : [26]
2018-07-04 15:00:22.378: [ CRSOCR][1066164720] OCR context init failure. Error: PROC-26: Error while accessing the physical storage ASM error [SLOS: cat=7, opn=kgfoAl06, dep=15077, loc=kgfokge
ORA-15077: could not locate ASM instance serving a required diskgroup
] [7]
2018-07-04 15:00:22.378: [ CRSD][1066164720][PANIC] CRSD exiting: Could not init OCR, code: 26
2018-07-04 15:00:22.379: [ CRSD][1066164720] Done.