Which disaster recovery method offers almost zero probability of downtime?

Downtime Costs – These include any lost productivity caused by failure (i.e., downtime) of the messaging systems. These include both scheduled and unscheduled downtime…. For the purposes of this study, we assume that unscheduled downtime affects 25% of the total user population, whereas scheduled downtime affects only the messaging IT staff.

Assuming most readers do not purchase the full study, this is the extent of background information provided by Radicati for enterprises considering $20 to $70 of downtime for each user per year. The entire Microsoft or IBM email solution, by Radicati's estimate, costs each organization about $279 to $285 TCO per user, per year. No wonder TCO is confusing, because the brief explanation describes a potential downtime cost in the range of 7% to 25% of the total solution cost. (Incidentally, if we ever implement a system where downtime costs that much, someone is losing his or her head over it!)

In fact, Radicati seems to agree. The same document later provides a review of service provider environment systems. That study gives the following definition for downtime:

Downtime Costs – These include time spent by full-time administrators dealing with system failures (i.e., unscheduled downtime) as well as scheduled downtime. We assume that both scheduled and unscheduled downtime affects all full-time messaging administrators. We do not attempt to measure the effect on the subscribers, though here the impact of higher downtime probably translates into higher subscriber attrition.

The dichotomy here is stunning: In an enterprise, you must pay for a user's downtime, but as a service provider you may summarily dismiss this expectation (although you may lose subscribers if service is poor). In reality, few organizations find themselves accepting either the black or the white explanation, most organizations find themselves considering a shade of gray. In BCO, you must address the second definition of downtime just listed—“full-time messaging administrators”; in other words, people and process time. Unlike “lost productivity,” these variables can be easily measured by recording how long it takes to restore service.

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9781555582890500112

Disaster Recovery

Scott R. Ellis, Lauren Collins, in Computer and Information Security Handbook (Third Edition), 2013

Steps in the Risk Process

There are five steps to consider for a company to come out ahead when a disaster hits to avoid risk and protect your data. The following checklist (see checklist: An Agenda for Action for Risk Assessment) is a list of these steps, from assessment to planning, architecting, specifying, and implementing a full-bodied DR solution.

Downtime presents serious consequences for businesses, no matter what their function may be. It is difficult, if not impossible, to recoup lost revenue and rebuild a corporate reputation that is damaged by an outage. While

An Agenda for Action for Risk Assessment

Steps in risk assessment (check all tasks completed):

_______1.

Discover the potential threats:

_______a.

Environmental (tornado, hurricane, flood, earthquake, fire, landslide, epidemic).

_______b.

Organized or deliberate disruption (terrorism, war, arson).

_______c.

Loss of utilities or services (electrical power failure, petroleum shortage, communications services breakdown).

_______d.

Equipment or system failure (internal power failure, air conditioning failure, production line failure, equipment failure).

_______e.

Security threat (leak of sensitive information, loss of records or data, cyber-crime).

_______f.

Supplementary emergency situations (workplace violence, public transportation disruption, health and safety hazard).

_______2.

Determine requirements:

_______a.

Prioritize processes

_______b.

Determine recovery objectives

_______c.

Plan for common incidents

_______d.

Communicate the plan

_______e.

Choose individuals who will test plan regularly and act in the event of a disaster

_______3.

Understand DR options:

_______a.

Determine how far to get the data out of the data center.

_______b.

Will the data center be accessible at the same time as the disaster.

_______c.

Determine the process to backup and/or replicate data off-site.

_______d.

Determine the process to recreate an environment off-site.

_______4.

Audit providers:

_______a.

Compare list of providers with internal list of requirements.

_______b.

Understand range of data protection solutions offered.

_______c.

Assess proximity (power grid/communications and contingencies).

_______d.

Data center hardening features and their DR contingencies.

_______5.

Record findings, implement/test, and revise if/as necessary:

_______a.

Documentation is the heart of your plan.

_______b.

Test and adjust plan as necessary, and record findings.

_______c.

As the environment changes and business needs change, revise the plan and test again.

professionals cannot expect to avoid every downtime event, the majority of system downtime is caused by preventable failures. Distinguishing between planned and unplanned system downtime allocates different procedures as both present vastly diverse paths when bringing systems back up.

While both planned and unplanned downtime can be stressful, planned downtime must be finished on time. Unplanned is the worst. Unplanned can be good for teaching troubleshooting techniques to junior IT staff, but can be very frustrating to the workforce. The authors see fewer techs with troubleshooting experience and more senior techs launching their own consultancies. The biggest cost is how it affects the customers and the impressions of the company that a severe outage can make on the organization's customers and clients. Planning, having people on staff with good troubleshooting skills, and documenting how the issue was found and fixed will help resolve the issue faster next time.

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128038437000363

Library management systems

Stuart Ferguson, Rodney Hebels, in Computers for Librarians (Third Edition), 2003

Down-time

Down-time is any period when the computer on which a system is based is not operating. This may be because of a serious hardware or software problem or simply because of power failure. For such cases, libraries require a back-up system. This may be manual, for example, the provision of transaction sheets on which staff write details of loans and returns (that is, borrower and item numbers), for later keying into the system. Many library management systems, however, provide automated back-up: for example, a portable unit that can store transaction data in machine-readable form until the main system is operating again. The data can then be uploaded to the main circulations control system in the correct sequence.

This should not be confused with another kind of back-up, which is the copying of data and software in case these are lost: for instance, from a magnetic hard disk to magnetic tape. This kind of back-up is not peculiar to circulations control subsystems and is discussed in more general terms in Chapter 5.

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9781876938604500100

Nagios 3

Max Schubert, in Nagios 3 Enterprise Network Monitoring, 2008

Migrating from Nagios 2 to 3

If you have a current installation of Nagios 2 you can install Nagios 3 and leverage your existing configuration without having to retune your deployment for your network. Although possible, this is not recommended as you will miss out on many of the enhancements in Nagios 3.

There are several important points to consider prior to upgrading your Nagios 2 installation to Nagios 3. The service_reaper_ frequency variable in the main configuration file has been renamed to check_result_reaper_ frequency. This option allows you to control the frequency in seconds of check result reaper events. Reaper events process the results from host and service checks that have finished executing. These events constitute the core of the monitoring logic in Nagios.

The $NOTIFICATIONNUMBER$ macro has been deprecated in favor of the new $HOSTNOTIFICATIONNUMBER$ and $SERVICENOTIFICATIONNUMBER$ macros. The $HOSTNOTIFICATIONNUMBER$ macro is the current notification number for the host. The notification number increases by one each time a new notification is sent out for the host, with the exception of acknowledgments, which do not cause the notification number to increase. The $SERVICENOTIFICATIONNUMBER$ macro is the current notification number for the service. The notification number increases by one each time a new notification is sent out for the service, with the exception of acknowledgments, which do not cause the notification number to increase.

Several directives, options, variables, and definitions have also been removed or depreciated and should no longer be used in Nagios 3. The parallelize directive in service definitions is now deprecated and no longer used, as all service checks are run in parallel. The aggregate_status_updates option has been removed. All status file updates are now aggregated at a minimum interval of one second. Extended host and extended service definitions have been deprecated. They are still read and processed by Nagios 3, but it is recommended that you move the directives found in these definitions to your host and service definitions, respectively.

The downtime_ file file variable in the main configuration file is no longer supported, as scheduled downtime entries are now saved in the retention file. The comment_file file variable in the main configuration file is no longer supported, as comments are now saved in the retention file.

Tip

To preserve existing downtime entries and existing comments, stop Nagios 2 and append the contents of your old downtime and comment files to the retention file.

Upgrading Using Nagios 3 Source Code

One way to upgrade your Nagios 2 deployment to Nagios 3 is to download the latest source code from the Nagios project's SourceForge.net page. The downloaded archive can be obtained using any Internet connected system and transferred to your Nagios server or it can be downloaded directly to your Nagios server using the wget command:

# wgethttp://osdn.dl.sourceforge.net/sourceforge/nagios/nagios-3.tar.gz

Depending on the current Nagios release, or the Nagios release you wish to download, you will have to adjust the filename accordingly. Once downloaded, you need to extract the files from the archive and install the Nagios software. If your server does not have the necessary development and dependant packages installed, the installation may not complete or operate as expected. At the time of this writing, regardless of your operating system type, the following dependencies must be installed prior to installing Nagios 3: the Apache HTTP server, the GCC compiler and development libraries specific to your distribution, and the GD graphics library.

Note

SourceForge.net is a source code repository and acts as a centralized location for software developers to control and manage open source software development. The Nagios project page on SourceForge is located athttp://sourceforge.net/projects/nagios/

The Apache HTTP server is required to provide a Web interface to manage your Nagios deployment. Some operating system distributions recommend certain versions of the Apache HTTP server over another. For example, when installing Nagios on an Ubuntu Linux or openSUSE distributions, Apache2 is recommended. Some older Linux distributions may not have the capability to run the Apache2 release and you may be forced to install on Apache 1.3.

The GNU Compiler Collection (GCC) is a set of compilers used to compile the raw Nagios code into a working application. Without the development libraries GCC relies on to build the application, the Nagios compile, and subsequent installation, will fail.

Tip

If your Unix, Linux, or BSD operating system has a package management utility installed, you usually need only specify that the GCC and development “tools” packages be installed. The package management utility is usually smart enough to automatically resolve any dependency issues for you.

The GD graphics library is an open source code library for the dynamic creation of images by programmers. Nagios uses the GD graphics library to generate the graphical representations of your collected data so it is easy to work with.

With the dependencies satisfied, and the Nagios archive downloaded, all that remains is to extract the archive and install it using the following commands:

#tar xzf nagios-3.tar.gz

#cd nagios-3

#./configure --with-command-group=nagcmd

#make all

#make install

#/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg

#/sbin/service nagios restart

If there are no errors generated during the compilation or installation, your Nagios installation has succeeded. If, for some reason you do receive errors, please review the exceptions for hints on how to resolve the issue and try the installation again.

If you alpha- or beta-tested the Nagios pre-released code, you need not worry about starting your Nagios deployment from scratch. Using the same source code installation process you can upgrade your pre-released Nagios deployment to the generally available final release, or any subsequent release, without losing your configuration information.

Generally speaking, this means that when a development release of Nagios is released you will have the ability to update from your final release, to several development releases, and eventually, to the final release of the new Nagios code.

If this is a production server, however, it is probably a good idea not to install pre-released Nagios code as there may be instabilities and vulnerabilities in the development version of Nagios.

Upgrading from an RPM Installation

The team behind Nagios releases the latest and greatest code in the form of compressed source code archives. Package-based releases for various operating systems—such as RPM for Red Hat distributions or DEB files for Debian distributions—are developed by members of the Nagios community and are usually driven by community demand.

To upgrade from your package-based Nagios 2 release to the source-based Nagios 3, you need to:

1

Back up your Nagios 2 configuration, retention, and log files. See the Backing up Your Nagios 2 Configuration Files section earlier in this chapter.

2

Uninstall the Nagios 2 package using the package management tools specific to your operating system distribution. For example, if using a Red Hat based Linux distribution, you could use the rpm -e command to uninstall the Nagios 2 package.

3

Install Nagios 3 from source. See the Upgrading Using Nagios 3 Source Code section earlier in this chapter.

4

Restore your Nagios 2 configuration, retention, and log files.

5

Verify your Nagios 3 configuration. Since we have copied an archived version of your Nagios 2 files, we should verify that there are no conflicting configuration issues by using the command:

# /usrs/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg

Tip

If there is an error in your configuration file, the error generated by the nagios -v command will point you to the line in the configuration file that appears to be causing the problem. If a warning is encountered the check will pass, as they are typically recommendations and not issues.

6

Start your Nagios 3 server. Now that you have verified that your configuration file will work with your new Nagios 3 installation, run the following command to start the server:

# /sbin/service nagios restart

Converting Nagios Legacy Perl Plug-ins

The Nagios software employs plug-ins to perform checks on managed hosts and services. In addition, these plug-ins may either be compiled executables, or human-readable scripts written in Perl or any of the Unix shells. For perl-based plugins Nagios provides the option of having the plug-ins interpreted via embedded Perl for Nagios (ePN).

If your Nagios installation is not using ePN there is nothing to use the plugins with Nagios 3. If, however, you have perl plugins that you wrote for Nagios 2 running under ePN, you will need to modify your plugins to specify that they wish to use ePN or set the variable use_embedded_perl_implicitly to 1 in the nagios.cfg configuration file. Add one of the following lines to your Perl plug-in within the first 10 lines of the plugin to instruct ePN to either execute the plugin by calling an external perl intrepreter or to execute the plug-in with ePN:

# Use embedded Perl for Nagios (ePN)

# nagios: +epn

or

# Do NOT use ePN; use the Perl interpreter outside of Nagios

# nagios: -epn

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9781597492676000012

Implementation and Transition to Operations

Christopher Longhurst, Christopher Sharp, in Practical Guide to Clinical Computing Systems (Second Edition), 2015

7.3 Business Resumption

After a downtime, the transition back to standard operations requires planning such that systems are restored, standard activities are resumed, and data captured during downtime is managed via archive or back-dated entry into the EHR. Operational decisions must be reached regarding which data may be captured and stored longitudinally via non-discrete forms, such as paper, and which shall be abstracted into the EHR via back-entry. Back-entry of discrete data can support longitudinal review, tracking and trending, clinical decision support, and other workflow and analytics needs. However, it comes with a risk of errors of commission in data entry, as well as errors of omission where process is not uniformly adopted. Therefore, policy and procedure must clarify these steps, and staff must be trained, practiced and the knowledge sustained as a preparedness measure.

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780124202177000079

Tools for Your Toolbox

Kelly C. Bourne, in Application Administrators Handbook, 2014

26.2.12 Restart the application if a critical service goes down

If the application downtime needs to be minimized, you could write a script to restart it if a critical process or service stops. When the monitoring tool, either something you wrote or a third-party tool, notices that a critical service has stopped, then it would restart the application. If your application involves more than a single service, it’s probably a good precaution to stop all of the processes or services and then restart them all to make sure they are started up in the correct sequence. If the vendor has provided scripts to start and stop the application, that definitely reduces the amount of work you have to do.

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123985453000261

Dependability Architecture

Bruce Powel Douglass Ph.D., in Real-Time UML Workshop for Embedded Systems (Second Edition), 2014

6.1 Overview

Smart systems are automating many processes that used to be reserved for the monitoring and intervention of highly trained personnel. This adds tremendous benefits in terms of cost and capability, but can we depend upon these systems? Dependability refers to the confidence with which we can entrust our lives to automated systems. Dependability has three primary aspects: safety, reliability, and security.

Reliability is a measure of the “uptime” or “availability” of a system – specifically, it is the probability that a computation will successfully complete before the system fails. It is normally estimated with mean time between failure (MTBF) or a related measure known as availability. MTBF is a statistical estimate of the probability of failure, and applies to stochastic failure modes.

Reducing the system downtime increases reliability by improving the MTBF. Redundancy is one design approach that increases availability because if one component fails, another takes its place. Of course, redundancy only improves reliability when the failures of the redundant components are independent.1 The situation in which a single failure can bring down multiple components is called a common mode failure. One example of a common-mode failure is running software for both the primary and secondary processing on the same CPU – should the processor fail, then both components will fail. In reliability analysis, great care must be taken to avoid common mode failures or to provide additional redundancy in the event oan element common to all redundant components fails.

The reliability of a component does not depend upon what happens after the system fails. That is, regardless of the impact of the failure, the reliability of the system remains the same. Clearly the primary concern relative to the reliability of a system is the availability of its functions to the user.

Safety is very different from reliability, but a great deal of analysis affects both safety and reliability. A safe system is one that does not incur too much risk of loss, either to persons or equipment. A hazard is an undesirable event or condition that can occur during system operation. Risk is a quantitative measure of how dangerous a system is and is usually specified as

Risk=Hazardseverity∗Hazardlikelihood

The failure of a jet engine is unlikely, but the consequences can be very severe. Overall, the risk of flying in a plane is tolerable because even though it is unlikely that you would survive a crash from 30,000 feet, such an incident is extremely unlikely. At the other end of the spectrum, there are events that are common, but are of lesser concern. A battery-operated radio has a hazard of electric shock, but the risk is acceptable because even though the likelihood of the hazard manifesting is relatively high, its severity is low.2

Faults come in two flavors. Errors are systematic faults introduced in analysis, design, implementation, or deployment. By “systematic,” we mean that the error is always present, even though it may not be always be manifest. In contrast, failures are random errors that occur when something breaks. Hardware exhibits both errors and failures but software exhibits only errors. The distinction between error and failure is important because different design patterns optimize the system against these concerns differently.

The key to managing both safety and reliability is redundancy. Redundancy improves reliability because it allows the system to continue to work in the presence of faults. Simply, the redundant system elements can take over the functionality of faulty ones and continue to provide system functionality. For improving safety, additional elements are needed to monitor the system to ensure that it is operating properly and possibly other elements are needed to either shut down the system in a safe way or take over the required functionality. The goal of redundancy used for safety is different – the concern is not about continuing to provide functionality, but instead to ensure that there is no loss (to either persons or equipment).

The example I like to use to demonstrate the difference is the handgun versus my ancient Plymouth station wagon. The handgun is highly reliable piece of equipment – most of them fire when dirty or even under water. It is, however, patently not very safe since even in the absence of a fault, you can (and people do) shoot yourself in the foot. On the other hand, my enormous 1972-vintage station wagon (affectionately referred to as “The Hulk”) is the safest automobile on the planet. It has a fail-safe state3 (“OFF”) and it spends all of its time in that state. So while the vehicle is very safe, it is not at all reliable.

As with the other architectural dimensions, safety and reliability are achieved through the application of architectural design patterns.4 All design patterns have costs and benefits, and selecting good safety patterns requires balancing the design concerns, such as

Development cost

Recurring (manufacturing) cost

Level of safety needed

Level of reliability needed

Coverage of systematic faults (errors)

Coverage of random faults (failures)

Complexity

Resource demand

Ease of certification against relevant standards

In general, safety and reliability patterns can be categorized into either homogenous or heterogeneous patterns. The former creates exact replicas of the architectural elements to provide redundant processing, and adds glue logic to determine when and under what circumstances the replicas run. The latter patterns use different implementations, designs, or approaches to provide redundant processing. These systems can be further subdivided into lightweight or heavyweight patterns. Lightweight patterns use fewer resources but may not be able to provide the full functionality or fidelity of the primary system elements. Heavyweight redundancy replicates the full functionality but at a greater cost.

Security is a bit different from reliability or safety but intersects with both. Security of information is called information assurance, but security is a broader issue with embedded devices. Certainly, security in the IT sense – managing the wired and wireless connections to prevent intrusions – is important, but it is not the only concern. In cyberphysical systems, we must concern ourselves with more mundane threats (such as walking off with the device) and we must concern ourselves with the severity of the potential outcomes as security of a nuclear power plant may compromise its safety. The solutions are likely to include a mixture of standard IT approaches and physical system security measures.

We will discuss details of these different aspects in more detail in the upcoming problems.

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780124077812000064

Recovery

Pierre Bijaoui, Juergen Hasslauer, in Designing Storage for Exchange 2007 SP1, 2008

Basic Recovery Rules

We recommend analyzing previous downtimes of your current Exchange infrastructure to understand potential failures and the weak points in your current environment. This is a prerequisite for designing your new infrastructure and defining the recovery strategy of your new solution.

You should document the configuration of your environment and create a change log to be able to verify why certain settings have been modified. An up-to-date documentation of the setup is very critical for recovery. It is not possible to back up everything and re-create everything by restoring all backup sets. For example, you might not have the same server hardware to recover an Exchange server. In this case, it is nearly impossible to successfully perform a system state restore of the Windows operating system. Documentation of the configuration is also required for other components like the backup library or the setup of the storage network switches.

Frequent verification of your recovery procedure to identify whether a configuration change affected your ability to recover your systems and data is critical. Do not spend thousands of dollars for a leading-edge geographically dispersed Exchange cluster deployment including data replication, but forget to document the necessary steps to perform a site failover in case a data center outage occurs. Finally, do not forget the infrastructure services that Exchange is relaying on. Verifying the failover of your Exchange mailbox cluster by unplugging the power cord from the active cluster node is not sufficient. You have to unplug the power cord of your Active Directory (AD) servers, Hub Transport (HT) server, Client Access Servers (CAS), storage array, network switches, and so on.

Running recovery fire drills is difficult in a production environment. Therefore, you should have a test environment that you also use for validation of service packs before you install them on your live systems.

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9781555583088000090

Design Your Operations for Six Sigma Manufacture

M. Joseph GordonJr., in Six Sigma Quality for Business and Manufacture, 2002

Severity

P-FMEA, typical results in process down time or high defect rates being produced.

M-FMEA, down time, internal may be short or require extensive time affecting production schedule if replacement part cannot be interchanged to keep process on-line. Also, consider;

1.

Available parts in stock or special order, with known supplier lead-time to deliver. Cost of critical spares in your inventory with established frequency of failure. Each equipment supplier should have an estimated MTBF (mean time between failure) rates on their parts.

2.

Installation time for repair to be completed after receipt of part could also identify individual technician’s repair times to affect the repair. May make a difference on how soon the system is running again. Also, the slower technician may require more intensive or special training.

3.

Any special installation tools, supplier specified, and testing when needed to verify if part or fix was correctly installed. List the services of other maintenance support personnel when required and their equipment requirements.

Each company must establish their own severity rating system similar to Table 1. The rating should consider in-house stocked repair parts and lead time penalties for non-stocked special parts. A down time maximum limit can be established with the highest rating that may occur if special parts are required. From this number, the severity rating is decreased, as repair time is less.

Table 1. Down Time Severity Ratings.

Duration of the Breakdown (in Days)Equal to or more thanLess thanSeverity Rating5–10459348235122011

Adapted from reference [2]

To keep the lowest ratings the following pro-active actions would be necessary depending on the situation.

1.

Carry replacement parts – in duplicate and when used, an immediate reorder is initiated. On-hand inventory depends on number of machines using similar parts and MTBF rates for replacement.

2.

Bar code all replacement parts and at issue are scanned out of stock along with the repair technicians badge bar code. The transaction also alerts purchasing that a part has been taken out of stock and should be replaced at an interval previously established based on the MTBF records. This will automatically track inventoried parts used and by what technicians. Makes record keeping simpler and ensures used parts are replaced.

3.

Periodic maintenance, inspection, and verification if a high wear or failure rate per item is identified.

4.

Black box (pre-assembled) replacement modules for rapid replacement, typically analog or digital machine control units.

Only in the most sensitive customer relations would stocking the customers order in advance be justified as for JIT using JIC (just-in-case) inventory methods, which are expensive and not always the best use of the companies money. This could be very expensive if and when an engineering change order comes in and the JIC inventory is not usable. This is upper management’s decision based on each individual customers requirements for delivery of their production units on time.

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780444510471500057

High Availability

In Virtualization for Security, 2009

Understanding High Availability

Before any discussion of how to provide for high availability can occur, you must first understand what high availability is and distinguish between planned and unplanned downtime. Planned downtime is downtime that has been scheduled and is expected in the environment. It is typically caused by system maintenance that, while disruptive to the overall system, usually can't be avoided. Reasons for planned downtime range from applying patches or configuration changes that may require a reboot to upgrading or replacing hardware. One benefit of planned downtime is that it can be more easily managed in order to minimize disruption. In many cases, as we will examine, virtualization can actually provide for zero downtime.

On the other hand, unplanned downtime is just the opposite of planned downtime. Unplanned downtime typically results from things like power outages, hardware failures, software crashes, network connectivity failures, security breaches, and operating system failures. While unplanned downtime cannot be easily predicted, it can be more easily recovered from in a virtual environment than in a physical environment.

Providing High Availability for Planned Downtime

While virtualization cannot provide for zero downtime high availability in all circumstances, it can provide for zero downtime in many circumstances. If you recall from previous chapters, the virtual machine is isolated from the underlying hardware and operating system by the hypervisor. This isolation allows the virtual machine to operate completely independently of the underlying hardware. Consequently, the actual host is effectively irrelevant. As long as there is a host available with the capacity necessary for the virtual machine to operate, the virtual machine can be run there.

This capability creates the first scenario for providing high availability in a virtual environment. Ultimately what matters is that the applications running in the virtual machine remain available. If you can do that, you have achieved zero downtime. So in circumstances where the downtime is planned and does not require the virtual machine itself to be rebooted or taken offline in any manner, providing high availability is as simple as relocating the virtual machine to another host, performing whatever maintenance is required, then bringing the host back online. This is depicted in Figure 11.1.

Which disaster recovery method offers almost zero probability of downtime?

Figure 11.1. How High Availability Works

Virtual machines are running on both HOSTA and HOSTB. If HOSTA needs to be taken down for maintenance, VM1 can be moved to HOSTB while the relevant maintenance is performed on HOSTA. HOSTA can then be brought back online returning the overall virtualization environment to its original capacity. In many cases the migration of the virtual machine between hosts can be done with no downtime using “live migration” technologies such as VMware VMotion. VMotion requires shared storage to be configured for all of the hosts, which in turn allows the virtual machine to be migrated between hosts while it remains online and operational. In many cases the migration occurs without any indication that the virtual machine has been moved.

Obviously if the virtual machine itself requires maintenance, for example applying patches to the virtual machine, which requires a reboot, you cannot have zero downtime. However if you need to replace or upgrade hardware on the host, apply patches to the host, or reconfigure the network on the host, virtualization can truly provide for zero downtime maintenance.

Providing High Availability for Unplanned Downtime

In a perfect world we would anticipate potential downtime and plan accordingly to minimize or eliminate the impact to the user community. Of course we do not live in a perfect world, which means that sooner or later an unplanned and unexpected outage is going to occur. In this circumstance, while virtualization cannot typically prevent the downtime, it can frequently minimize the amount of time the systems are down.

Similar to the planned downtime scenario, because the virtual machines are independent of the underlying hardware, if a failure occurs on a given host most virtualization vendors provide a mechanism such as VMware High Availability Clustering to automatically identify the host failure and bring the virtual machines online on a different host. While this will not prevent the downtime, because of the automation that virtualization provides the amount of downtime is typically a fraction of what it would be on a physical system.

Which of these disaster recovery approaches has the lowest downtime?

Multi-site Active/Active is the most reliable DR solution and it's the only strategy that can guarantee almost zero downtime and data loss.

Which disaster recovery approach would result in the least downtime during a disaster recovery event?

A pilot light approach minimizes the ongoing cost of disaster recovery by minimizing the active resources, and simplifies recovery at the time of a disaster because the core infrastructure requirements are all in place.

Which among the following is used for disaster recovery in AWS?

Disaster recovery services like AWS Elastic Disaster Recovery can move a company's computer processing and critical business operations to its own cloud services in the event of a disaster.

What is RPO and RTO in AWS?

Resiliency can be defined in terms of metrics called RTO (Recovery Time Objective) and RPO (Recovery Point Objective). RTO is a measure of how quickly can your application recover after an outage and RPO is a measure of the maximum amount of data loss that your application can tolerate.