Solution Architecture Best Practice: Using System Availability and Recovery Metrics

//
Before endeavoring on an IT project involving the introduction of a new software package or or expansion of an existing one, business leaders need to know the impact of such an initiative on revenues, labor costs, and capital budgets. A solution architecture design document  (aka SAD) can help as long as it is part of an overall business impact or disaster recovery planning process. When drafting a solution architecture design document, helpful metrics such as system availability, recovery time objective (RTO) and recovery point objective (RPO) can help determine the desired runtime characteristics the business wants to achieve. Non-technical business leaders and subject matter experts may not necessarily care about “the nines” (99.999% availability, for instance), but they do care about lost revenue per hour, minute and second that the system (hardware software as a whole) that the company incurs when an IT asset is offline, or the labor costs of workers standing idle or having to resort to manual business process steps. Conversely, IT operating team members don’t necessarily care about the notion of these costs, but cares more about the nines. But for many, arriving at the right set of nines to assign to an IT project that introduces or expands a system is not exactly straightforward.

I’m offering an approach to help you assign a set of nines to your system availability objective. By “system,” I am referring to the combination of hardware and software. The following table provides industry-standard mappings of “nines” to acceptable down times for different availabilities for a given one-year period.

90%

99%

99.9%

99.99%

99.999%

99.9999%

40 days

4 days

9 hours

50 minutes

5 minutes

30 seconds

How do we know which of the sets of nines is applicable? It depends on the business subject matter experts, and in turn, they may rely on the operations team to supply data. But in the case where neither the business SMEs or the operations teams have such numbers, a good rule of thumb is to first have the business SMEs, tally the Line of Business (LOB) revenue per hour, minute or second of any given business process that would be impacted if the system in question went down. Have them do the same for revenue per hour, minute and second. Don’t worry about downtimes just yet; we only want to know how much money is generated by the business process per hour/min/sec, and then the labor cost (or overhead costs, operating costs) per hour/min/sec.

Next, identify the cost of maintaining each of the sets of nines (the greater the number of nines, the greater the maintenance cost).

Finally, if the loss of revenues per hour/min/sec noticeably exceeds the cost of maintaining the desired nines, then it might be advisable to absorb the maintenance costs. In the absence of revenues, the project’s maintenance budget can be used, but caution has to be used here as the budget may not align with lost revenues when a system goes down as the budget is almost always smaller than the company’s revenues for the impacted business process.

Labor costs should be used in a separate metric to identify the amount of money a company pays its employees when the system is unavailable. To recap, we have three system availability decision metrics to use from a business standpoint to help us arrive at a decision on which of the nines to choose:

Availability Decision Per Revenue
  1. Tally Revenue generated per hour, min, seconds
  2. Identify the cost of maintaining each of the sets of nines
  3. Availability Decision Ratio (ADR) = Revenues (R) / Cost of Nines (CoN), where a number greater than 1 indicates that the chosen set of nines is doable

 

Availability Decision Per Labor Costs Similar to Availability Decision Per Revenue above, except you use Labor Costs (LC) instead of Revenues
Availability Decision Per Maintenance Budget Similar to Availability Decision Per Revenue above, except you use Maintenance Budget (MB) instead of Revenues

Regarding recovery metrics, an article on Wikipedia does a great job in explaining them. I provide a snippet below, and invite you to go to http://en.wikipedia.org/wiki/Recovery_point_objective to read the rest. I have highlighted some sentences to call your attention to important principles.

The recovery time objective (RTO) is the duration of time and a service level within which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business continuity.[1] It can include the time for trying to fix the problem without a recovery, the recovery itself, testing, and the communication to the users. Decision time for users representative is not included. RTO is spoken of as a complement of RPO (or Recovery point objective) with the two metrics describing the limits of acceptable or “tolerable” ITSC performance in terms of time lost(RTO) from normal business process functioning, and in terms of data lost or not backed-up during that period of time(RPO) respectively. The rule in setting an RTO should be that the RTO is the longest period of time the business can do without the IT Service in question.

A “recovery point objective” or “RPO”, is defined by business continuity planning. It is the maximum tolerable period in which data might be lost from an IT service due to a major incident.[1] The RPO gives systems designers a limit to work to. For instance, if the RPO is set to 4 hours, then in practice, offsite mirrored backups must be continuously maintained- a daily offsite backup on tape will not suffice. Care must be taken to avoid two common mistakes around the use and definition of RPO. Firstly, BC Staff use business impact analysis to determine RPO for each service – RPO is not determined by the existent backup regime. Secondly, when any level of preparation of offsite data is required, rather than at the time the backups are offsited- the period during which data is lost very often starts near the time of the beginning of the work to prepare backups which are eventually offsited.

How RTO and RPO values affect computer system design

The RTO and RPO form part of the first specification for any IT Service. The RTO and the RPO have a very significant effect on the design of computer services and for this reason must be considered in concert with all the other major system design criteria.

When assessing the abilities of system designs to meet RPO criteria, for practical reasons, the RPO capability in a proposed design is tied to the times backups are sent offsite- if for instance offsiting is on tape and only daily (still quite common), then 49 or better, 73 hours is the best RPO the proposed system can deliver, so as to cover for tape hardware problems (tape failure is still too frequent, one bad tape can write off a whole daily synchronisation point). Another example- if a service is to be properly set up to restart from any point (data is capable of synchronisation at all times) and offsiting is via synchronous copies to an offsite mirror data storage device, then the RPO capability of the proposed service is to all intents and purposes 0 hours- although it is normal to allow an hour for RPO in this circumstance to cover off any unforeseen difficulty.

If the RTO and RPO can be set to be more than 73 hours then daily backups to tapes (or other transportable media), that are then couriered on a daily basis to an offsite location, comfortably covers backup needs at a relatively low cost. Recovery can be enacted at a predetermined site. Very often this site will be one belonging to a specialist recovery company who can more cheaply provide serviced floor space and hardware as required in recovery because it manages the risks to its clients and carefully shares (or “syndicates”) hardware between them, according to these risks.

If the RTO is set to 4 hours and the RPO to 1 hour, then a mirror copy of production data must be continuously maintained at the recovery site and close to dedicated recovery hardware must be available at the recovery site- hardware that is always capable of being pressed into service within 30 minutes or so. These shorter RTO and RPO settings demand a fundamentally different hardware design- which is for instance, relatively much more expensive than tape backup designs.

3 thoughts on “Solution Architecture Best Practice: Using System Availability and Recovery Metrics

  1. Great items from you, man. I have consider your stuff prior to and you are just too magnificent.
    I actually like what you’ve acquired here,
    really like what you are saying and the best way by which you are saying it.
    You make it entertaining and you still take care of to keep it sensible.
    I can not wait to read much more from you.

    That is actually a great site.

  2. I believe what you said made a great deal of sense. But,
    think on this, what if you were to create a awesome headline?
    I ain’t saying your content is not solid, but suppose you added a post title that makes people want more?
    I mean Solution Architecture Best Practice:
    Using System Availability and Recovery Metrics
    | Samsona Software is kinda vanilla. You ought to glance at Yahoo’s front page and note how
    they create post titles to get people to click. You might add a
    related video or a picture or two to grab people excited about what you’ve
    got to say. In my opinion, it would bring your posts a
    little livelier.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s