Monday, August 27, 2012

CRS-4535: Cannot communicate with Cluster Ready Services


CRS-4535: Cannot communicate with Cluster Ready Services

Environment:
Enterprise Linux Enterprise Linux Server release 5.5 (Carthage)
Oracle Grid Infrastructure 11.2.0.1
Oracle database server 11.2.0.1


> crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

ocrcheck was successful from one node and fails in the second node.

[root@atlracp01 crsd]# ocrcheck
Status of Oracle Cluster Registry is as follows :
          Version                  :          3
          Total space (kbytes)     :     262120
          Used space (kbytes)      :       2392
          Available space (kbytes) :     259728
          ID                       : 1786641452
          Device/File Name         : /u01_CRS/ocr/ocr1
                                    Device/File integrity check succeeded
          Device/File Name         : /u02_CRS/ocr/ocr2
                                    Device/File integrity check succeeded
          Device/File Name         : /u03_CRS/ocr/ocr3
                                    Device/File integrity check succeeded

                                    Device/File not configured

                                    Device/File not configured

          Cluster registry integrity check succeeded

          Logical corruption check succeeded


[root@atlracp02 ~]# ocrcheck
PROT-602: Failed to retrieve data from the cluster registry
PROC-26: Error while accessing the physical storage Operating System error [No such file or directory] [2]


So, the OCR itself is good and there is no corruption. I checked the permissions, ownership, etc and they looked fine and the same in both the nodes in the cluster. But, one of the cluster was definitely having some issue.
So, do not attempt to restore OCR from backups.

Checking the crsd.log from $GI_HOME/log/<host_name>/crsd

[root@atlracp02 crsd]# tail -f crsd.log
2012-08-24 13:58:23.903: [  OCROSD][538518096]utopen:6m'': OCR location [/u01_CRS/ocr/ocr1] configured is not a valid storage type. Rturn code [37].
2012-08-24 13:58:23.903: [  OCROSD][538518096]utopen:7:failed to open any OCR file/disk, errno=9, os err string=Bad file descriptor
2012-08-24 13:58:23.903: [  OCRRAW][538518096]proprinit: Could not open raw device
2012-08-24 13:58:23.903: [ default][538518096]a_init:7!: Backend init unsuccessful : [26]
2012-08-24 13:58:24.908: [  OCROSD][538518096]NFS file system /u01_CRS/ocr mounted with incorrect options
2012-08-24 13:58:24.909: [  OCROSD][538518096]WARNING:Expected NFS mount options: wsize>=32768,rsize>=32768,hard,(noac | actimeo=0 | acregmin=0,acregmax=0,acdirmin=0,acdirmax=0)
2012-08-24 13:58:24.909: [  OCROSD][538518096]utopen:6m'': OCR location [/u01_CRS/ocr/ocr1] configured is not a valid storage type. Rturn code [37].
2012-08-24 13:58:24.909: [  OCROSD][538518096]utopen:7:failed to open any OCR file/disk, errno=9, os err string=Bad file descriptor
2012-08-24 13:58:24.909: [  OCRRAW][538518096]proprinit: Could not open raw device
2012-08-24 13:58:24.909: [ default][538518096]a_init:7!: Backend init unsuccessful : [26]
2012-08-24 13:58:25.911: [  OCROSD][538518096]NFS file system /u01_CRS/ocr mounted with incorrect options
2012-08-24 13:58:25.911: [  OCROSD][538518096]WARNING:Expected NFS mount options: wsize>=32768,rsize>=32768,hard,(noac | actimeo=0 | acregmin=0,acregmax=0,acdirmin=0,acdirmax=0)
2012-08-24 13:58:25.911: [  OCROSD][538518096]utopen:6m'': OCR location [/u01_CRS/ocr/ocr1] configured is not a valid storage type. Rturn code [37].
2012-08-24 13:58:25.911: [  OCROSD][538518096]utopen:7:failed to open any OCR file/disk, errno=9, os err string=Bad file descriptor
2012-08-24 13:58:25.911: [  OCRRAW][538518096]proprinit: Could not open raw device
2012-08-24 13:58:25.911: [ default][538518096]a_init:7!: Backend init unsuccessful : [26]


Checking mount options on the failing node.

[root@atlracp02 ~]# cat /etc/fstab|grep -i ocr
atlprod01-node2:/vol/oracle_crs0labpp /u01_CRS/ocr nfs rw,bg,hard,nointr,rsize=32768,wsize=32768,tcp,vers=3,timeo=600,actimeo=0 0 0
atlprod01-node1:/vol/oracle_crs1labpp /u02_CRS/ocr nfs rw,bg,hard,nointr,rsize=32768,wsize=32768,tcp,vers=3,timeo=600,actimeo=0 0 0
atlprod01-node2:/vol/oracle_crs2labpp /u03_CRS/ocr nfs rw,bg,hard,nointr,rsize=32768,wsize=32768,tcp,vers=3,timeo=600,actimeo=0 0 0

 [root@atlracp02 ~]# cat /proc/mounts |grep -i ocr
atlprod01-node2:/vol/oracle_crs0labpp /u01_CRS/ocr nfs rw,vers=3,rsize=65536,wsize=65536,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=atlprod01-node2 0 0
atlprod01-node1:/vol/oracle_crs1labpp /u02_CRS/ocr nfs rw,vers=3,rsize=65536,wsize=65536,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=atlprod01-node1 0 0
atlprod01-node2:/vol/oracle_crs2labpp /u03_CRS/ocr nfs rw,vers=3,rsize=65536,wsize=65536,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=atlprod01-node2 0 0
atlprod01-node1:/vol/oracle_crs1labpp/.snapshot /u02_CRS/ocr/.snapshot nfs rw,vers=3,rsize=65536,wsize=65536,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=atlprod01-node1 0 0

The fstab entries look good to me.. It is the same as the node where i do not have any issues. After exhausting all other options/checks, decided to unmount and mount the OCR volumes on the failing node. Note, this is a Netapp NFS cluster file system.


[root@atlracp02 crsd]# df -Ph|grep -i ocr
atlprod01-node2:/vol/oracle_crs0labpp  1.0G  4.7M 1020M   1% /u01_CRS/ocr
atlprod01-node1:/vol/oracle_crs1labpp  1.0G  4.7M 1020M   1% /u02_CRS/ocr
atlprod01-node2:/vol/oracle_crs2labpp  1.0G  4.7M 1020M   1% /u03_CRS/ocr

[root@atlracp02 crsd]# umount /u01_CRS/ocr

[root@atlracp02 crsd]# df -Ph|grep -i ocr
atlprod01-node1:/vol/oracle_crs1labpp  1.0G  4.7M 1020M   1% /u02_CRS/ocr
atlprod01-node2:/vol/oracle_crs2labpp  1.0G  4.7M 1020M   1% /u03_CRS/ocr

[root@atlracp02 crsd]# mount /u01_CRS/ocr

[root@atlracp02 crsd]# df -Ph|grep -i ocr
atlprod01-node1:/vol/oracle_crs1labpp  1.0G  4.7M 1020M   1% /u02_CRS/ocr
atlprod01-node2:/vol/oracle_crs2labpp  1.0G  4.7M 1020M   1% /u03_CRS/ocr
atlprod01-node2:/vol/oracle_crs0labpp  1.0G  4.7M 1020M   1% /u01_CRS/ocr

The /proc/mounts entry for the re-mounted volume looked more aligned to my fstab entry.

[root@atlracp02 crsd]# cat /proc/mounts |grep -i ocr
atlprod01-node1:/vol/oracle_crs1labpp /u02_CRS/ocr nfs rw,vers=3,rsize=65536,wsize=65536,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=atlprod01-node1 0 0
atlprod01-node2:/vol/oracle_crs2labpp /u03_CRS/ocr nfs rw,vers=3,rsize=65536,wsize=65536,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=atlprod01-node2 0 0
atlprod01-node2:/vol/oracle_crs0labpp /u01_CRS/ocr nfs rw,vers=3,rsize=32768,wsize=32768,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=atlprod01-node2 0
               
Proceeding with unmounting and mounting the remaining 2 OCR volumes.

[root@atlracp02 crsd]# umount /u02_CRS/ocr
[root@atlracp02 crsd]# mount /u02_CRS/ocr
[root@atlracp02 crsd]# cat /proc/mounts |grep -i ocr
atlprod01-node2:/vol/oracle_crs2labpp /u03_CRS/ocr nfs rw,vers=3,rsize=65536,wsize=65536,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=atlprod01-node2 0 0
atlprod01-node2:/vol/oracle_crs0labpp /u01_CRS/ocr nfs rw,vers=3,rsize=32768,wsize=32768,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=atlprod01-node2 0 0
atlprod01-node1:/vol/oracle_crs1labpp /u02_CRS/ocr nfs rw,vers=3,rsize=32768,wsize=32768,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=atlprod01-node1 0 0


[root@atlracp02 crsd]# umount /u03_CRS/ocr
umount: /u03_CRS/ocr: device is busy
umount: /u03_CRS/ocr: device is busy

[root@atlracp02 crsd]# umount -f /u03_CRS/ocr
umount2: Device or resource busy
umount: /u03_CRS/ocr: device is busy
umount2: Device or resource busy
umount: /u03_CRS/ocr: device is busy

Tried to do a lazy unmount.

[root@atlracp02 crsd]# umount -l /u03_CRS/ocr

It worked..

[root@atlracp02 crsd]#  cat /proc/mounts |grep -i ocr
atlprod01-node2:/vol/oracle_crs0labpp /u01_CRS/ocr nfs rw,vers=3,rsize=32768,wsize=32768,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=atlprod01-node2 0 0
atlprod01-node1:/vol/oracle_crs1labpp /u02_CRS/ocr nfs rw,vers=3,rsize=32768,wsize=32768,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=atlprod01-node1 0 0

[root@atlracp02 crsd]# mount /u03_CRS/ocr

[root@atlracp02 crsd]# cat /proc/mounts |grep -i ocr
atlprod01-node2:/vol/oracle_crs0labpp /u01_CRS/ocr nfs rw,vers=3,rsize=32768,wsize=32768,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=atlprod01-node2 0 0
atlprod01-node1:/vol/oracle_crs1labpp /u02_CRS/ocr nfs rw,vers=3,rsize=32768,wsize=32768,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=atlprod01-node1 0 0
atlprod01-node2:/vol/oracle_crs2labpp /u03_CRS/ocr nfs rw,vers=3,rsize=32768,wsize=32768,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=atlprod01-node2 0 0


Now, stopped and started crs and it came back fine.

crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online



3 comments:

  1. Thanks so much - that really helped me. Clusterware was down on one node because ocr mounts were not correct.

    Patrick

    ReplyDelete
  2. Hi...

    [root@rac1 crsd]# tail -f crsd.log
    2018-07-04 15:00:22.377: [ OCRASM][1066164720]proprasmcl: asmhandle is NULL
    2018-07-04 15:00:22.377: [ OCRRAW][1066164720]proprinit: Could not open raw device
    2018-07-04 15:00:22.377: [ OCRASM][1066164720]proprasmcl: asmhandle is NULL
    2018-07-04 15:00:22.378: [ OCRAPI][1066164720]a_init:16!: Backend init unsuccessful : [26]
    2018-07-04 15:00:22.378: [ CRSOCR][1066164720] OCR context init failure. Error: PROC-26: Error while accessing the physical storage ASM error [SLOS: cat=7, opn=kgfoAl06, dep=15077, loc=kgfokge
    ORA-15077: could not locate ASM instance serving a required diskgroup
    ] [7]
    2018-07-04 15:00:22.378: [ CRSD][1066164720][PANIC] CRSD exiting: Could not init OCR, code: 26
    2018-07-04 15:00:22.379: [ CRSD][1066164720] Done.

    ReplyDelete
  3. Hi..

    [root@rac1 crsd]# tail -f crsd.log
    2018-07-04 15:00:22.377: [ OCRASM][1066164720]proprasmcl: asmhandle is NULL
    2018-07-04 15:00:22.377: [ OCRRAW][1066164720]proprinit: Could not open raw device
    2018-07-04 15:00:22.377: [ OCRASM][1066164720]proprasmcl: asmhandle is NULL
    2018-07-04 15:00:22.378: [ OCRAPI][1066164720]a_init:16!: Backend init unsuccessful : [26]
    2018-07-04 15:00:22.378: [ CRSOCR][1066164720] OCR context init failure. Error: PROC-26: Error while accessing the physical storage ASM error [SLOS: cat=7, opn=kgfoAl06, dep=15077, loc=kgfokge
    ORA-15077: could not locate ASM instance serving a required diskgroup
    ] [7]
    2018-07-04 15:00:22.378: [ CRSD][1066164720][PANIC] CRSD exiting: Could not init OCR, code: 26
    2018-07-04 15:00:22.379: [ CRSD][1066164720] Done.

    ReplyDelete