Monday, August 27, 2012

CRS-4530: Communications failure contacting Cluster Synchronization Services daemon



CRS-4530: Communications failure contacting Cluster Synchronization Services daemon

Environment:
Oracle Grid Infrastructure 11.2.0.1
Oracle database server 11.2.0.1

> crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager


> crsctl stat res -t -init
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
      1        OFFLINE OFFLINE
ora.crsd
      1        ONLINE  INTERMEDIATE raclr41
ora.cssd
      1        ONLINE  OFFLINE
ora.cssdmonitor
      1        ONLINE  ONLINE       raclr41
ora.ctssd
      1        ONLINE  OFFLINE
ora.diskmon
      1        OFFLINE OFFLINE
ora.drivers.acfs
      1        OFFLINE OFFLINE
ora.evmd
      1        ONLINE  OFFLINE
ora.gipcd
      1        ONLINE  ONLINE       raclr41
ora.gpnpd
      1        ONLINE  ONLINE       raclr41
ora.mdnsd
      1        ONLINE  ONLINE       raclr41

Tried to start ora.cssd manually

raclr41 | CRS | /home/oracle
> crsctl start res ora.cssd -init

It was not responding and was hung. Checked the ocssd log from another session ($GI_HOME/log/<host_name>/cssd)

2012-08-15 14:05:50.103: [ GIPCNET][1120729408]gipcmodNetworkProcessConnect: slos op  :  sgipcnTcpConnect
2012-08-15 14:05:50.103: [ GIPCNET][1120729408]gipcmodNetworkProcessConnect: slos dep :  No route to host (113)
2012-08-15 14:05:50.103: [ GIPCNET][1120729408]gipcmodNetworkProcessConnect: slos loc :  connect
2012-08-15 14:05:50.103: [ GIPCNET][1120729408]gipcmodNetworkProcessConnect: slos info:  addr '192.168.1.110:29850'
2012-08-15 14:05:50.103: [    CSSD][1120729408]clssscSelect: conn complete ctx 0x2aaaac09bae0 endp 0xa66
2012-08-15 14:05:50.103: [    CSSD][1120729408]clssnmeventhndlr: node(1), endp(0xa66) failed, probe((nil)) ninf->endp (0x100000a66) CONNCOMPLETE
2012-08-15 14:05:50.103: [    CSSD][1120729408]clssnmDiscHelper: raclr40, node(1) connection failed, endp (0xa66), probe(0x100000000), ninf->endp 0xa66
2012-08-15 14:05:50.103: [    CSSD][1120729408]clssnmDiscHelper: node 1 clean up, endp (0xa66), init state 0, cur state 0
2012-08-15 14:05:50.103: [GIPCXCPT][1120729408]gipcInternalDissociate: obj 0x11588660 [0000000000000a66] { gipcEndpoint : localAddr 'gipc://raclr41:68bf-1bc8-a218-974f#192.168.1.111#13372', remoteAddr 'gipc://raclr40:nm_raclr#192.168.1.110#29850', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x8061a, usrFlags 0x0 } not associated with any container, ret gipcretFail (1)
2012-08-15 14:05:50.103: [GIPCXCPT][1120729408]gipcDissociateF [clssnmDiscHelper : clssnm.c : 3301]: EXCEPTION[ ret gipcretFail (1) ]  failed to dissociate obj 0x11588660 [0000000000000a66] { gipcEndpoint : localAddr 'gipc://raclr41:68bf-1bc8-a218-974f#192.168.1.111#13372', remoteAddr 'gipc://raclr40:nm_raclr#192.168.1.110#29850', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x8061a, usrFlags 0x0 }, flags 0x0
2012-08-15 14:05:50.103: [    CSSD][1120729408]clssnmDiscEndp: gipcDestroy 0xa66
2012-08-15 14:05:50.111: [    CSSD][1108113728]clssnmvDHBValidateNCopy: node 1, raclr40, has a disk HB, but no network HB, DHB has rcfg 229086889, wrtcnt, 9907057, LATS 1513031694, lastSeqNo 9907057, uniqueness 1345052387, timestamp 1345053949/1513006814
2012-08-15 14:05:50.111: [    CSSD][1120729408]clssnmconnect: connecting to addr gipc://raclr40:nm_raclr#192.168.1.110#29850
2012-08-15 14:05:50.111: [    CSSD][1120729408]clssscConnect: endp 0xa72 - cookie 0x2aaaac09bae0 - addr gipc://raclr40:nm_raclr#192.168.1.110#29850
2012-08-15 14:05:50.111: [    CSSD][1120729408]clssnmconnect: connecting to node(1), endp(0xa72), flags 0x10002
2012-08-15 14:05:50.343: [    CSSD][1115998528]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2012-08-15 14:05:50.391: [    CSSD][1112844608]clssnmvDHBValidateNCopy: node 1, raclr40, has a disk HB, but no network HB, DHB has rcfg 229086889, wrtcnt, 9907057, LATS 1513031974, lastSeqNo 9907057, uniqueness 1345052387, timestamp 1345053949/1513006814
2012-08-15 14:05:50.391: [    CSSD][1103583552]clssnmvDHBValidateNCopy: node 1, raclr40, has a disk HB, but no network HB, DHB has rcfg 229086889, wrtcnt, 9907057, LATS 1513031974, lastSeqNo 9907057, uniqueness 1345052387, timestamp 1345053949/1513006814
2012-08-15 14:05:51.115: [    CSSD][1108113728]clssnmvDHBValidateNCopy: node 1, raclr40, has a disk HB, but no network HB, DHB has rcfg 229086889, wrtcnt, 9907058, LATS 1513032704, lastSeqNo 9907058, uniqueness 1345052387, timestamp 1345053950/1513007814


> cat /etc/hosts |grep 192.168.1.110
192.168.1.110   raclr40ic raclr40ic.imanheim.com


That is the interconnect ip.

Now to the interconnects.

> ping 192.168.1.110
PING 192.168.1.110 (192.168.1.110) 56(84) bytes of data.
From 192.168.1.111 icmp_seq=2 Destination Host Unreachable
From 192.168.1.111 icmp_seq=3 Destination Host Unreachable
From 192.168.1.111 icmp_seq=4 Destination Host Unreachable

--- 192.168.1.110 ping statistics ---
6 packets transmitted, 0 received, +3 errors, 100% packet loss, time 4999ms
, pipe 3

So, the interconnect interface was down. Engaged system administrators and brought the interface back online. That fixed the issue.

> crs_stat -t
Name           Type           Target    State     Host
------------------------------------------------------------
ora....ER.lsnr ora....er.type ONLINE    ONLINE    raclr40
ora....N1.lsnr ora....er.type ONLINE    ONLINE    raclr40
ora....N2.lsnr ora....er.type ONLINE    ONLINE    raclr41
ora....N3.lsnr ora....er.type ONLINE    ONLINE    raclr41
ora.asm        ora.asm.type   OFFLINE   OFFLINE
ora....SM1.asm application    OFFLINE   OFFLINE
ora....18.lsnr application    ONLINE    ONLINE    raclr40
ora....418.gsd application    OFFLINE   OFFLINE
ora....418.ons application    ONLINE    ONLINE    raclr40
ora....418.vip ora....t1.type ONLINE    ONLINE    raclr40
ora....SM2.asm application    OFFLINE   OFFLINE
ora....19.lsnr application    ONLINE    ONLINE    raclr41
ora....419.gsd application    OFFLINE   OFFLINE
ora....419.ons application    ONLINE    ONLINE    raclr41
ora....419.vip ora....t1.type ONLINE    ONLINE    raclr41
ora.eons       ora.eons.type  ONLINE    ONLINE    raclr40
ora.gsd        ora.gsd.type   OFFLINE   OFFLINE
ora....network ora....rk.type ONLINE    ONLINE    raclr40
ora.oc4j       ora.oc4j.type  OFFLINE   OFFLINE
ora.ons        ora.ons.type   ONLINE    ONLINE    raclr40
ora....ry.acfs ora....fs.type OFFLINE   OFFLINE
ora.scan1.vip  ora....ip.type ONLINE    ONLINE    raclr40
ora.scan2.vip  ora....ip.type ONLINE    ONLINE    raclr41
ora.scan3.vip  ora....ip.type ONLINE    ONLINE    raclr41







4 comments:

  1. I had a similar problem for
    ora.crsd
    1 ONLINE OFFLINE
    and managed to start ora.crsd manually after reading this blog.

    Oracle up again. Thank you! :)

    ReplyDelete
  2. Thanks for your clear documentation.
    I'm running 11.2.0.1.0 on OL 5.10.
    How do you make it run at boot?
    Every time I reboot this machine, I have to issue "crsctl start res ora.cssd -init" again.
    Maybe it's related to this error:

    [item1@mtp dbs]$ sqlplus / as sysdba

    SQL*Plus: Release 11.2.0.1.0 Production on Fri Jul 18 16:48:15 2014

    Copyright (c) 1982, 2009, Oracle. All rights reserved.

    Connected to an idle instance.

    SQL> startup
    ORA-00099: warning: no parameter file specified for ASM instance
    ORA-01031: insufficient privileges
    SQL> exit

    I found a reference to this issue here: https://community.oracle.com/message/9821617


    ReplyDelete
  3. Great Blog...solved..issue....

    ReplyDelete
  4. Great! Thank You
    crsctl start res ora.cssd -init

    Solved the issue

    ReplyDelete