High Availability and Disaster Recovery Concepts
- Contents
High Availability
To mitigate these risks, business-critical applications often rely on redundant hardware to provide fault tolerance. If the primary system fails, then the application can automatically fail over to the redundant system. This is the underlying principle of high availability (HA).
Disaster Recovery
In these situations, some applications may rely on restoring the latest backup to recover as much data as possible. However, more critical applications may require a redundant server to hold a synchronized copy of the data in a secondary location. This is the underpinning concept of disaster recovery (DR).
The amount of time that a solution is available to end users is known as the level of availability, or uptime
Table 1-1 details the amount of acceptable downtime per week, per month, and per year for each level of availability
Actual Availability
Now that we understand how to calculate the level of availability required by an application, we should also understand how to calculate the actual availability of an application. We can do this by discovering the MTBF (Mean Time Between Failures) and
MTTR (Mean Time to Recover) metricsThe MTBF metric describes the average length of time between failures.
MTTR can actually have two different meanings:
- Mean Time to Recover or
- Mean Time to Repair.
When there is no HA or DR in place, then the MTTR metric describes the average length of time it takes for something that is broken to be repaired. This could potentially be an extended period of time, as we may need to wait for a service engineer
to replace faulty hardware.When thinking about HA and DR, however, we use MTTR metric to mean Mean Time to Recover to record the duration of the outage We can calculate the MTTR by taking a sum of the total downtime duration within our period and dividing it by the number of failures
Service-Level Agreements and Service-Level Objectives
When a third-party provider is responsible for managing servers, the contract usually includes service-level agreements (SLAs). These SLAs define many parameters, including how much downtime is acceptable, the maximum length of time a server can
be down in the event of failure, and how much data loss is acceptable if failure occurs. Normally, there are financial penalties for the provider if these SLAs are not met.However, in internal scenarios, it is much more common for IT departments to negotiate service-level objectives (SLOs) with the business teams, as opposed to SLAs. SLOs are very similar in nature to SLAs, but their use implies that the business do not impose financial penalties on the IT department in the event that they are not met.
Proactive Maintenance
It is important to remember that downtime is not only caused by failure but also by proactive maintenance. For example, if you need to patch the operating system, or SQL Server itself, with the latest service pack, then you must have some downtime during
installation.In this situation, high availability is essential for many business-critical applications – not to protect against unplanned downtime, but to avoid prolonged outages during planned maintenance.
Recovery Point Objective and Recovery Time Objective
The recovery point objective (RPO) of an application indicates how much data loss is acceptable in the event of a failure.
For a data warehouse that supports a reporting application, for example, this may be an extended period, such as 24 hours, given that it may only be updated once per day by an ETL process and all other activity is read-only reporting. For highly transactional systems, however, such as an OLTP database supporting trading platforms or web applications, the RPO will be zero. An RPO of zero means that no data loss is acceptable.
Applications may have different RPOs for high availability and for disaster recovery. For example, for reasons of cost or application performance, an RPO of zero may be required for a failover within the site. If the same application fails over to a DR data
center, however, five or ten minutes of data loss may be acceptable. This is because of technology differences used to implement intra-site availability and inter-site recovery. The recovery time objective (RTO) for an application specifies the maximum amount of time an application can be down before recovery is complete and users can reconnect. When calculating the achievable RTO for an application, you need to consider many aspects.For example, it may take less than a minute for a cluster to fail over from one node to another and for the SQL Server service to come back up; however, it may take far longer for the databases to recover. The time it takes for databases to recover depends
on many factors, including the size of the databases, the quantity of databases within an instance, and how many transactions were in-flight when the failover occurred. This is because all noncommitted transactions need to be rolled back.If you must restore a database, the worst-case scenario is that the achievable point of recovery may be the time of the last backup. This means that you must factor a hard business requirement for a specific RPO into your backup strategy.
Cost of Downtime
it is never possible to guarantee zero downtime, and once you begin to explain the costs associated with the different levels of availability, it starts to get easier to negotiate a mutually acceptable level of service.
The key factor in deciding how many 9s you should try to achieve is the cost of downtime. Two categories of cost are associated with downtime: tangible costs and intangible costs. Tangible costs is lost revenue because the sales staff cannot take orders.
Intangible costs are more difficult to quantify but can be far more expensive. For example, if a customer is unable to place an order with your company, they may place their order with a rival company and never return. Other intangible costs can include loss of staff morale, which leads to higher staff turnover, or even loss of company reputation. Because intangible costs, by their very nature, can only be estimated, the industry rule of thumb is to multiply the tangible costs by three and use this figure to represent your intangible costs.Classification of Standby Servers
There are three classes of standby solution.