Thursday, June 15, 2023

RMAN - Validate plus Preview - the optimal combination?

For a long time, we have worked with VALIDATE and PREVIEW in RMAN to check the backups we created for completeness and consistency. There is a problem that neither of these commands covers the full spectrum of what you want to check. Reviewing this way of working, I (finally) improved the automation of these checks.

Basically, when using Preview, it lists all the files it needs to get to the specified point for restore/recover. On the other hand, Validate actually checks if files are available and usable. The drawback of Preview is that you don't check the files to be actually there. The drawback of Validate is that it only checks the files to restore, not any files needed after that to do a recovery. Only by using a combination of the two outputs, you can get a higher degree of confidence that you have all the files necessary and that they are actually usable.

I came across this statement (which can be augmented with a "summary" to shorten the output):

RMAN> RESTORE DATABASE VALIDATE PREVIEW;

It first selects all files needed to restore and recover, and then it checks the files with validate. That means that you don't need two commands and compare the output, but you can just use this single command. That works fine for Level 0 backups, archivelog backups and separate archives (those not yet in backup). Problem is that any Level 1 backups that are listed by Preview, are not actually validated.

Disclaimer: This syntax (validate and preview combined) is not documented like this, but it seems to do the trick. I will (cautiously) use this from now on and see if this holds up in the real world...

Checking the level 1 backups

With the output from the Preview, it is easy to write a script that extracts all the Level 1 backups and generate an RMAN script to validate all backupsets from the Level 1 backups. The output from that check plus the original output gives you a higher level of confidence in your backups. You now know that your Level 0, Level 1, archivelog backups and archivelogs are all present and usable in case of a restore/recover scenario.

Checking recoverability - check your SCNs

When using the "preview" command, somewhere at the end of the preview section output, it will say something like this:

recovery will be done up to SCN  2482924
Media recovery start SCN is      2482924
Recovery must be done beyond SCN 2484433 to clear datafile fuzziness

Here you can see that the "recovery up to SCN" is equal to the "Start SCN", while it should be higher than the "Recovery beyond SCN" to get to a consistent state. That's not good. However, this can easily be cleared by using an UNTIL TIME clause. It will do the same file checks, but the values displayed in the summary change. This makes it easier to (automatically) check if you are getting what you want.

RMAN> RESTORE DATABASE UNTIL TIME "sysdate" VALIDATE PREVIEW SUMMARY;

recovery will be done up to SCN  2641133
Media recovery start SCN is      2482924
Recovery must be done beyond SCN 2484433 to clear datafile fuzziness

Much better. It doesn't change the files it checks, but now the logging is clearer.

The ultimate RMAN check?

You could even add one final touch: check the logical sanity of your files:

RESTORE DATABASE UNTIL TIME "sysdate" VALIDATE CHECK LOGICAL PREVIEW SUMMARY;

Putting it all together - scripting your validation

A rather simple script can be made to give you the validation and some checks. Can probably be done much cleaner and nicer, but I didn't really focus on the scripting syntax yet. You can easily guess the contents of the ".rmn" script used here... 
Some assumptions apply, like that the ORACLE_SID has been set before the script is run. Be aware that some modifications will probably be needed before using this on your own environment.

This version has some basic checks:
  • RMAN error checking
  • SCN validation (check if restore/recover goes beyond the "clear fuzziness" SCN)
  • Level 1 backup validation 
#!/bin/sh
# -----------------------------------------------
# RMAN validate backups
# -----------------------------------------------
. /home/oracle/.bash_profile
export NLS_DATE_FORMAT="dd-mm-yyyy hh24:mi:ss"
export LOGFILE=rman_validate_preview.log
export BASEDIR=$(dirname "$0")
#
cd $BASEDIR
echo $(date +'%d-%m-%Y %H:%M:%S') " - starting backup validate" > $LOGFILE
echo $(date +'%d-%m-%Y %H:%M:%S') " - creating RMAN output files" >> $LOGFILE
#
# Call the rman script and write output to separate file
#
echo $(date +'%d-%m-%Y %H:%M:%S') " - RMAN validate preview" >> $LOGFILE
rman target / @validate_preview.rmn log=validate_preview.log
#
# Check level 1 backups (if any)
# Also check controlfile restore
# (comes in handy when no level 1 backups exist to prevent an empty run block)
#
LEVEL1CMDFILE=validate_level_1.rmn
echo "run {" > $LEVEL1CMDFILE
echo "restore controlfile validate;" >> $LEVEL1CMDFILE
awk '/ B  1 / {print "validate backupset " $1 ";"}' validate_preview.log >> $LEVEL1CMDFILE
echo "}" >> $LEVEL1CMDFILE
#
echo $(date +'%d-%m-%Y %H:%M:%S') " - RMAN validate controlfile and level 1 backups" >> $LOGFILE
rman target / @$LEVEL1CMDFILE log=validate_level_1.log
#
# check the output for errors
#
echo $(date +'%d-%m-%Y %H:%M:%S') " - Check output for errors" >> $LOGFILE
ERROR_COUNT=$(grep RMAN- validate_*.log | wc -l)
if [ $ERROR_COUNT -eq 0 ]
then
  echo "No errors found in validation" >> $LOGFILE
else
  echo "Errors found in validation" >> $LOGFILE
  echo "--------------------------------" >> $LOGFILE
  grep RMAN- validate_*.log         >> $LOGFILE
  echo "--------------------------------" >> $LOGFILE
fi
echo "######################################################" >> $LOGFILE
#
# Check the validate preview for SCN validity
#
echo $(date +'%d-%m-%Y %H:%M:%S') " - Compare SCNs for recovery" >> $LOGFILE
SCN_START=$(awk '/Media recovery start SCN is/ {print $6}' validate_preview.log)
SCN_RECOV=$(awk '/recovery will be done up to SCN/ {print $8}' validate_preview.log)
SCN_FUZZY=$(awk '/Recovery must be done beyond SCN/ {print $7}' validate_preview.log)
echo "Recovery start at SCN $SCN_START and ends at SCN $SCN_RECOV" >> $LOGFILE
if [ $SCN_RECOV -lt $SCN_FUZZY ]
then
  echo "ERROR: Recovery ends at $SCN_RECOV, but should be done beyond $SCN_FUZZY" >> $LOGFILE
else
  echo "Recovery ends at $SCN_RECOV and this is beyond $SCN_FUZZY, as it should be" >> $LOGFILE
fi
echo "######################################################" >> $LOGFILE
#
# Append all output to global logfile
#
echo $(date +'%d-%m-%Y %H:%M:%S') " - Detailed log output" >> $LOGFILE
echo "######################################################" >> $LOGFILE
cat validate_preview.log >> $LOGFILE
echo "######################################################" >> $LOGFILE
cat validate_level_1.log >> $LOGFILE
echo "######################################################" >> $LOGFILE
echo $(date +'%d-%m-%Y %H:%M:%S') " - End of report" >> $LOGFILE