5.18.2013

A tricky standby database situation

A very tricky and interesting situation came-up this morning while configuring one of the standby databases of over 1.5TB sized . Whilst the database is being cloned to the DR site as part of the DUPLICATE..ACTIVE DATABASE command, which actually took more than 1.5 day, a couple of new datafiles were added to the primary database.  After cloning process was over, the newly build DR database was almost 2 days behind withe the primary database. I knew I can make it in SYNC the PRIMARY and STANDBY applying the standby roll-froward method, but, I already have a daily cumulative incremental backups on TAPE.  If I perform incremental backup to do the roll-forward upgrade, it gonna take much time. Hence, I determined to make use of the existing backups. When the the roll-forward method was followed, the following confronted:

RMAN> SWITCH DATABASE TO COPY;

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of switch to copy command at 05/18/2013 10:25:29
RMAN-06571: datafile 58 does not have recoverable copy


Obviously, it was expected, because the datafile in the question was added after standby database creation initiations.

Workaround:
Had to try out-of-the-box solution (roll-forward method).
  1. Re-create and restore the standby controlfile
  2. Restore missing datafiles on the standby
  3. Catalog standby database datafiles (diskgroup was different from primary)
  4. Recover the database
  5. Complete the rest of the standby configure to make it in sync 
Will be writing a detailed article on this. Stay tuned for more.

Happy reading

Jaffar


5.14.2013

New Page - Data Guard

A quick update about the new page.

I have created a new page (tab) 'Data Guard' on my blog to share/discuss all data guard related issues that we confronted during our extensive DR setup and testing. The objective is to record all the errors/issues of data guard setup and how we resolve them. Also, I will be sharing the DR configuration procedure and the best practices that we used in our environment.

Appreciate your inputs, and if you are interested to share/write something on the subject matter, do write to me, I will put it on the page under your name.

Have a nice day,

Jaffar


5.11.2013

Its data guard time for the team yet again

Just a very quick update about my upcoming tasks and what I will be doing for the next 3 weeks time.

It is indeed going to be a super busy rest of the month for the entire team as over 41 RAC databases data guard configuration need to be done.  We will be pretty engaged and occupied for the next 3 weeks creating standby databases and configuring DG setup in the context to have a fully functional DR environment.

We have done similar practice in the past (a few months ago) to test the DR capabilities for database and application, and now its time to have a permanent DR configuration. Therefore, anticipate a lot of blogging about DR stuff in the coming days at my blog.

Wish me luck people.


Jaffar


5.03.2013

Things to be considerd before/after the OS patch deployment

The objective of this write-up is to emphasize the importance of considering things like verifying the patch compatibility and  relinking the Oracle home after patching the underlying Operating System (OS)  in any Oracle environment. I would like to share an incident (a little story) that we encountered a few days ago in one of our non-production RAC environments where the Clusterware stack didn't start after the OS patch deployment.

As part of the patching policy set in the organization, our HPUX admin scheduled the latest quarterly HPUX v11.3x OS patch deployment activity on all servers, and a non-RAC and Oracle RAC environments have patched in the context. Though the patching activity went smoothly on both the environments,  we faced issues starting the Cluster stack in the Cluster environment. When the cluster stack status was verified, we have noticed that the Cluster Synchronization Daemon process (cssd) was in 'STARTING' state, as shown below:


$ ./crsctl stat res -init -t
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details       
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
      1        ONLINE  OFFLINE      rac1                     
ora.cluster_interconnect.haip
      1        ONLINE  OFFLINE      rac1                     
ora.crsd
      1        ONLINE  OFFLINE      rac1                     
ora.cssd
      1        ONLINE  OFFLINE      rac1                     STARTING                
ora.cssdmonitor
      1        ONLINE  ONLINE       rac1                      


Oracle High Availability Daemon process (ohsd) started without any issues, however, the crsd couldn't be started on any of the nodes after the patch deployment . Upon examining the ocssd.log, it was found that some how the voting disks were not able to discover by the process, hence, the crsd process couldn't start and the following messages appeared in the ocssd.log:

CRS-1714:Unable to discover any voting files
2013-04-23 18:47:16.553: [ SKGFD][6]Discovery with str:/dev/rdsk/c0t5d5,/dev/rdsk/c0t5d4:

2013-04-23 18:47:16.553: [ SKGFD][6]UFS discovery with :/dev/rdsk/c0t5d5:
2013-04-23 18:47:16.559: [ SKGFD][6]Fetching UFS disk :/dev/rdsk/c0t5d5:
2013-04-23 18:47:16.559: [ SKGFD][6]OSS discovery with :/dev/rdsk/c0t5d5:
2013-04-23 18:47:16.559: [ SKGFD][6]Discovery advancing to nxt string :/dev/rdsk/c0t5d4:
2013-04-23 18:47:16.559: [ SKGFD][6]UFS discovery with :/dev/rdsk/c0t5d4:
2013-04-23 18:47:16.564: [ SKGFD][6]Fetching UFS disk :/dev/rdsk/c0t5d4:
2013-04-23 18:47:16.564: [ SKGFD][6]OSS discovery with :/dev/rdsk/c0t5d4:
2013-04-23 18:47:16.564: [ CSSD][6]clssnmvDiskVerify: Successful discovery of 0 disks
2013-04-23 18:47:16.564: [ CSSD][6]clssnmCompleteInitVFDiscovery: Completing initial voting file discovery
2013-04-23 18:47:16.564: [ CSSD][6]clssnmvFindInitialConfigs: No voting files found

From the messages it was pretty clear that for some reasons the voting disks (placed on the shared storage) are inaccessible to the node/s. When searched over the internet and in the My Oracle Support (MOS) with the combination of error codes, all the links were pointing to verify the ownership and permission on the voting disks. We found there were no issues with regards to the ownership and permissions on the voting disks, we even dumped the the disks with the DD command found no corruption and no ownership/permission issues.  After 1 hour of hard struggles, there was a little hope about the issue when we come across of a MOS note (id 1508899.1) that explained an incident close to ours.
According the note, this issue was due to a bug : 14810756 and the workaround is to apply patch: 14810756 or rollback the OS patch PHCO_43004. There was no chance of applying the patch for us as we were not able to start-up the cluster, hence, we verified with the OS admin whether PHCO_43004 is part of the bundle patch that deployed a while ago on HPUX 11.3x plat form. The OS admin then confirmed us that the particular patch is indeed part of the patch bundle deployed a while ago. We then requested the OS admin to roll-back the patch in the context to try our luck. After rolling back the patch from a node, Clusterstack successfully started on the node. We did the same on the rest of the nodes and everything came back successfully.
The MOS note states that the issue likely to happen during the execution of the rootupgrade.sh script as part of the the cluster upgrade from 11.2.0.2 to 11.2.0.3 on the HPUX 11.3x platform, and when the voting disks is placed on disk/raw devices.
We fail to understand why the HP didn't mentioned about this behavior despite there were similar issues recorded and addressed on the HP forums.

Conclusion:
The motive of his blog entry is emphasize the importance of verifying the compatibility of the PATCH before deploying in any environment.
Also, it is highly advised to relink the binaries manually right after the OS patch deployment. The following demonstrates how to relink the binaries in 11gR2 GI RAC env.:

as the root user:
Unlock the CRS (ensure cluster stack is not running on the server)
$GRID_HOME/crs/install/rootcrs.pl -unlock

cd $ORACLE_HOME/rdbms/lib
make -f ins_rdbms.mk rac_on ioracle


References:
  • How to Check Whether Oracle Binary/Instance is RAC Enabled and Relink Oracle Binary in RAC [ID 284785.1]
  • hp-ux: 11gR2 GI Fails to Start or rootupgrade.sh Fails with "clsfmt: Received unexpected error 4 from skgfifi for file" if PHCO_43004 is Applied [ID 1508899.1]



4.05.2013

Managing & Troubleshooting Cluster - 360 degrees -- upcoming webinar

My upcoming webinar 'Managing & Troubleshooting Cluster - 360 degrees' sponsored/arranged by RedGate is scheduled on 25th April 2013. The following topic will be covered during the course of the presentation:


  • What's new in 11gR2 Clusterware – Key new features at a glance
  • Oracle 11gR2 Clusterware software stack
  • Clusterware start up sequence
  • CRS logs & directory tree structure
  • Analyzing CRS logs
  • CRS logs rotation/retention policy
  • Troubleshooting Cluster start up issues
  • Debugging/Tracing CRS components
  • Tools & Utilities – how to pick the right one
  •   Q&A
If you haven't enrolled yet, enroll now. Registration is limited!.

See you at the webinar (of course virtually).

Jaffar