Wiki source code of High Availability and Disaster Recovery Concepts

Last modified by Billie D on 2021/02/12 14:22
{{box cssClass="floatinginfobox" title="**Contents**"}}
{{toc /}}
{{/box}}

[[image:About Us.WebHome@myitguide_small_01.jpg]]

= High Availability =

To mitigate these risks, business-critical applications often rely on redundant hardware to provide fault tolerance. If the primary system fails, then the application can __automatically fail over__ to the redundant system. This is the underlying principle of **high availability** (**HA**).

= Disaster Recovery =

In these situations, some applications may rely on __restoring the latest backup to recover__ as much data as possible. However, more critical applications may require a redundant server to hold a synchronized copy of the data in a secondary location. This is the underpinning concept of **disaster recovery** (**DR**).  


The amount of time that a solution is available to end users is known as the level of availability, or uptime

Table 1-1 details the amount of acceptable downtime per week, per month, and per year for each level of availability [[attach:LevelsofAvailability.sql||target="_blank"]][[image:SQLDA001.JPG||height="421" width="1030"]]


= Actual Availability =

Now that we understand how to calculate the level of availability required by an application, we should also understand how to calculate the actual availability of an application.  We can do this by discovering the **MTBF** (**Mean Time Between Failures**) and
**MTTR** (**Mean Time to Recover**) metrics

The **MTBF** metric describes the average length of time between failures.

**MTTR** can actually have two different meanings:

* Mean Time to Recover or
* Mean Time to Repair.

When there is __no HA or DR in place__, then the MTTR metric describes the average length of time it takes for something that is broken to be **repaired**. This could potentially be an extended period of time, as we may need to wait for a service engineer
to replace faulty hardware.

When thinking about __HA and DR__, however, we use MTTR metric to mean Mean Time to **Recover** to record the duration of the outage  We can calculate the MTTR by taking a sum of the total downtime duration within our period and dividing it by the number of failures  

= Service-Level Agreements and Service-Level Objectives =

When a third-party provider is responsible for managing servers, the contract usually includes **service-level agreements** (**SLA**s). These SLAs define many parameters, including __how much downtime is acceptable__, the __maximum length of time __a server can
be down in the event of failure, and __how much data loss is acceptable__ if failure occurs.  Normally, there are financial penalties for the provider if these SLAs are not met.

However, in internal scenarios, it is much more common for IT departments to negotiate **service-level objectives** (**SLO**s) with the business teams, as opposed to SLAs. SLOs are very similar in nature to SLAs, but their use implies that the business do not impose financial penalties on the IT department in  the event that they are not met.

= Proactive Maintenance =

It is important to remember that downtime is not only caused by failure but also by proactive maintenance.  For example, if you need to patch the operating system, or SQL Server itself, with the **latest service pack**, then you must have some downtime during
installation.

In this situation, high availability is essential for many business-critical applications – not to protect against unplanned downtime, but to avoid prolonged outages during planned maintenance.

= Recovery Point Objective and Recovery Time Objective =

The **recovery point objective** (**RPO**) of an application indicates __how much data loss is acceptable__ in the event of a failure. 

For a __data warehouse__ that supports a __reporting application__, for example, this may be an __extended period, such as 24 hours__, given that it may only be __updated once per day by an ETL process__ and all other activity is read-only reporting. For highly transactional systems, however, such as an __OLTP database__ supporting trading platforms or web applications, the __RPO will be zero__. An RPO of zero means that no data loss is acceptable.

//Applications// may have different RPOs for high availability and for disaster recovery. For example, for reasons of cost or application performance, an RPO of zero may be required for a failover within the site. If the same application fails over to a DR data
center, however, five or ten minutes of data loss may be acceptable. This is because of technology differences used to implement intra-site availability and inter-site recovery.  The recovery time objective (RTO) for an application specifies the maximum amount of time an application can be down before recovery is complete and users can reconnect. When calculating the achievable RTO for an application, you need to consider many aspects.

For example, it may take less than a minute for a cluster to fail over from one node to another and for the SQL Server service to come back up; however, it may take far longer for the databases to recover. The time it takes for databases to recover depends
on many factors, including the size of the databases, the quantity of databases within an instance, and how many transactions were in-flight when the failover occurred. This is because all noncommitted transactions need to be rolled back.

If you must restore a database, the worst-case scenario is that the achievable point of recovery may be the time of the last backup. This means that you must factor a hard business requirement for a specific RPO into your backup strategy.

= Cost of Downtime =

it is __never possible to guarantee zero downtime__, and once you begin to explain the costs associated with the different levels of availability, it starts to get easier to negotiate a mutually acceptable level of service.

The key factor in deciding how many 9s you should try to achieve is the cost of downtime. //Two categories of cost are associated with downtime//: tangible costs and intangible costs. **Tangible costs **is lost revenue because the sales staff cannot take orders.
**Intangible costs** are more difficult to quantify but can be far more expensive. For example, if a customer is unable to place an order with your company, they may place their order with a rival company and never return.  Other intangible costs can include loss of staff morale, which leads to higher staff turnover, or even loss of company reputation. Because intangible costs, by their very nature, can only be estimated, the industry rule of thumb is to multiply the tangible costs by three and use this figure to represent your intangible costs.

= Classification of Standby Servers =

There are three classes of standby solution.

[[image:SQLDA002.JPG||height="589" width="996"]]