Posts

Time is money!

     Your boss keeps talking about RPO (Recovery Point Objective)  and RTO (Recovery Time Objective).  Do you just nod your head like you know what he/she is talking about?  Maybe that scenario just happened and you are searching the internet for what these terms mean.  If so, welcome!  No one likes to think about disasters, but they happen all too often.  Planning for the worst and hoping for the best will keep your data safe and your job even safer. Let’s take some time and explore what RPO and RTO mean, why these things are important, and what you need to do next to be a hero DBA!

RTO (Recovery Time Objective)

     Recovery Time Objective is the amount of time in which your company expects you to have the database fully restored after a disaster.  That is, how much downtime is acceptable for disaster recovery or planned outages.  Each company is different, and most reference RTO in terms of nines. 

     The nines measure for a company that measures 365 days a year, 24 hours a day as follows:

5 9’s – 99.999% (this translates to about 5 minutes of acceptable downtime per year)
4 9’s – 99.99% (this translates at about 52.5 minutes per year and is much easier to achieve)
3 9’s – 99.9%  (this translates at about 8.75 hours per year)
2 9’s – 99% (translates to about 3.5 days a year)

     To decide what RTO is best for your company, you need to take into consideration your data needs.  Not all companies run on a 365/24 schedule.  Some companies only measure downtime between 8am-6pm Monday through Friday, or only on the weekends.   This will drastically change the translation of the 9’s.  Another thing to think about is whether the measured downtown includes time for maintenance or patching, times when the database must be offline.  If maintenance time is eliminated from consideration, meeting the higher 9’s is much easier.

     If your company insists on an RTO of 5 Nines and does not take into consideration maintenance or patching, then you must speak with the persons in charge to discuss the RPO.  It is possible to adhere to the strict 5 minutes of downtime, but the point at which you are able to recover, will definitely be restricted.

RPO (Recovery Point Objective)

     Recovery Point Objective is the level of data or work that is acceptable to lose in the event of a disaster.   Ideally, companies will want ZERO data or work loss.  While that IS achievable, it will all depend on  valid backups and the extent of damage the database suffered at the point of disaster.  

     An RPO of 15 minutes means that the data and work must be recoverable to a point within 15 minutes of the disaster, meaning that it is expected that only 15 minutes of work or data may be lost.  Stop right here and think about your backup  plans and recovery models.  Restoring a database that is in simple recovery model should not take as long as a restoring one in full recovery model. It is important to remember (from previous blog posts), the recovery model dictates how much data you can recover. It is also important to remember, the ability to recover ANY data at all is fully dependent on having valid backups.

Run Book

     Another term you might hear is “Run Book”.  A Run Book is a physical or digital collection of information that is needed to restart the database in case of disaster.  There are many items that should be included in the runbook.  Some of the essential items one should consider having in the runbook are:

  • Server level info, configuration, purpose, etc.
  • List of all databases and applications using them
  • List of agent jobs and proper response to a failure
  • Disaster Recovery process with all contacts, RPO/RTO, etc. required to bring it back (based on level of issue)
  • Security
  • Backup schedules

     When considering a run book, think about what someone would need if they were new to the company and the only person available to restart the database.  What information would that person need?  Making sure your run book is up to date on a regular basis is certainly a great idea!

Preparing for disaster

     Keep in mind that if you prepare for the worst, you will be less likely to be caught off-guard with a manager breathing down your neck asking “WHEN WILL WE BE BACK UP AND RUNNING?!?!”  Do you have any idea how long it will take to restore your database?  If your answer is “no,” I would suggest doing a restore to see how long this takes.  Further, I would suggest making it a habit to perform drills so that you and your team know what to do in the event of a disaster, and exactly how long it takes to get your company back up and running. Having a solid backup schedule, validating those backups, and keeping your company’s expectations in mind, you will be ready to handle any data disaster that may be thrown your way.  

      You have your SQL Server Backup Plan and your Database Recovery Model set.  How do you know if your Backups are good?  TEST!  Validating SQL Server Backups will ensure that you are in a good place when it is time to bring your database back from the dead!  

Don’t assume that your Backups are solid and let them sit on a shelf.  Corrupt backups are recoverable, but worthless.

 There are several methods for validating your Backups.

    • RESTORE –  The most effective way to validate that your backups are good is to run a test Restore.  If your Restore is successful, you have a solid backup.  Make sure to run a test restore on your Full, Differential, Point in Time, and Transaction Logs!   “Bonus points” if you automate refreshing non-production.
  • Backup with CHECKSUM It may not be realistic to run regular test restores on every single database, this is where CHECKSUM is your friend.  CHECKSUM is part of a backup operation which will instruct SQL Server to test each page being backed up with its corresponding checksum, making sure that no corruption has occurred during the read/write process.  If a bad checksum is found, the backup will fail.  If the backup completes successfully, there are no broken page checksums.
    • BEWARE though, this does not ensure that the database is corruption free, CHECKSUM only verifies that we are not backing up an already-corrupt database. (Later in this post we discuss checking data for corruption.) If it seems like too much trouble to write a CHECKSUM script every time you want to perform a backup, keep in mind that these can be automated as SQL Agent Jobs!  A sample T-SQL script for using CHECKSUM is as follows:


Backup Database TestDB
To Disk='G:DBABackupsTestDBFull_MMDDYYYY.bak'
With CheckSum;

    • VERIFY – It is not wise to rely solely on CHECKSUM, a good addition is to use RESTORE VERIFYONLY.  This will verify the backup header, and also that the backup file is readable.  Note that much like CHECKSUM, this will check to see if there are errors during the read/write process of the backup; however, it will not verify that the data itself is valid or not corrupt.  Despite the name “RESTORE VERIFONLY”, it does not actually restore the data.   VERIFY too can be automated to perform each time your backups utilizing CHECKSUM run. 
  • CHECKSUM on Restore –  Databases where BACKUP WITH CHECKSUM have been performed can then be additionally verified as part of the restore process.  This will check data pages contained in the backup file and compare it against the CHECKSUM used during the backup. Additionally, if available, the page checksum can be verified as well. If they match, you have a winner… 
    More Details on CHECKSUM and BACKUP CHECKSUM


Restore Database TestDB;
From Disk='G:DBABackupsTestDBFull_MMDDYYYY.bak'
With VerifyOnly;

Data Validation Prior to Taking Backups

    Keep in mind that if your data is corrupt prior to a backup, SQL Server can BACKUP that CORRUPTED DATA.  The validation methods mentioned above guard you against corruption occurring during backups, not against corrupted data within the backup.  For data validation prior to backups being run, it is suggested that DBCC CHECKDB be performed on each database on a regular basis.

  • DBCC CHECKDB –  SQL Server is very forgiving and will usually backup and restore corrupted data.  A best practice is to run a DBCC CHECKDB on your data to check for potential corruption.  Running CHECKDB regularly against your production databases will detect corruption quickly.  Thus providing a better chance to recover valid data from a backup, or being able to repair the corruption. CHECKDB will check the logical and physical integrity of the database by running these three primary checks*:
      • CHECKALLOC – checks the consistency of the database;
      • CHECKTABLE – checks the pages and structures of the table or indexed view; and
    • CHECKCATALOG – checks catalog consistency. 

 Automate Validation Steps  

    Corruption can happen at any time, most of the time it is related to a hardware issue.  Automating the steps necessary to validate your data and backups will help ensure you have the best practices in place to efficiently recover from catastrophic data loss.  Being able to backup and restore is not as important as being able to recover with valid data.   Despite the above keys for validation, the only true way to verify that your backups are valid is to actually restore the database.  It bears repeating: corrupt backups are recoverable, but worthless.  

*A full list of DBCC CHECKDB checks can be found here.

Last week we covered five reasons why log shipping should be used.  I got a great question that I thought should be answered here in a short blog post.  The question is “how do you configure transactional log shipping to include compression when taking the transactional log backups?”

The Missing Question

First, a better question is “should we be compressing our SQL Server Backups?” In a second we will address log shipping, but first I wanted to just focus on compressing backups. The answer is “it depends” (typical DBA response).  There is CPU overhead for utilizing compression. We will not focus too much on this here as you can enable resource governor to limit CPU usage for your backups.

Today, we will focus on capacity and misleading storage saving results due to enabling backup compression.  Looking at a traditional DBA role, you might have exposure to view your server, drives, and the size of the backups. Therefore, taking a compressed backup leads you to likely see less storage being used for the backup file compared to other backups of the same database.  This is typically the goal for backup compression to reduce storage space consumed by backups.  

Another question you should be asking yourself is, “What storage system is being used for your SQL Server backups?” For example, storage access networks (SAN) might have its own compression and native SQL Server backup compression might hurt the impact of the SAN’s compression, which could cause more RAW storage to be used. Therefore, there really isn’t a silver bullet to always use SQL Server backup compression in every environment.  You need to understand if there is any compression built into the storage used for your backups, and understand how backup compression impacts the storage system before deciding to utilize backup compression.

Compress Log Shipping Backup

Okay, now that we got our disclaimer common real-world mistake out of the way.  Here is how you would enable backup compression for transactional log shipping.

You can access Log Shipping Settings from Database Properties

Log Shipping Backup Settings on Primary

Log Shipping Backup Settings on Primary

Here are Log Shipping backup compression options.

Here are Log Shipping backup compression options.

Enabling Compression with T-SQL Backups

Enabling backup compression with T-SQL is as simple as just adding the compression option.  The following is a common example of backing up a database with compression.

[code language=”sql”] BACKUP DATABASE [LSDemo] TO DISK = N’c:\somewhere\LSDemo.bak’ WITH COMPRESSION [/code]

Enable Compression by Default

If all of your backups for an instance of SQL Server go to storage where utilizing native SQL Server compression provides improvement, then you can consider enabling compression by default. Therefore, all backups taken on this instance of SQL Server would then use backup compression by default unless another option was given during the T-SQL execution of the backup command.

[code lang=”sql”]
EXEC sys.sp_configure N’backup compression default’, N’1′
GO
RECONFIGURE WITH OVERRIDE
GO
[/code]