Cutting RTO from 5 Hours to 45 Minutes: What the Textbooks Don't Tell You

Five hours. That was the documented recovery time objective for our most critical systems when I joined the DR redesign project at leading Bank in Bangladesh. And I want to be straight with you: five hours was not the real problem. The real problem was that five hours assumed everything went perfectly during a recovery.

In practice, our drills regularly ran longer. Runbook assumptions did not match the current state of the environment. Access credentials for the DR site had expired. Key contacts had changed. The person with authority to formally declare a failover was not always reachable. In one drill, the first 90 minutes were spent figuring out who had up-to-date access credentials. That 90 minutes was not a technology problem. It was a governance problem.

Eighteen months later, we were running consistent 45-minute recoveries across 60-plus critical applications across 187 branches, with a 95% drill success rate. And when a flood impacted one of our primary sites in 2024, the team executed the runbook under genuinely difficult conditions and hit the recovery target with zero data loss.

Here is what actually drove the improvement.

According to the Veeam 2024 Data Protection Trends Report, fewer than one in three organisations believe they could recover from even a small crisis within a week. The Uptime Institute's 2024 Resiliency Survey found that 48% of outages are caused by procedural failures rather than technical ones. The technology is usually not the weak link.

The five-hour RTO was a documentation problem, not a technical one

The first thing we did was run a proper dependency mapping exercise. We walked through every critical application and mapped its infrastructure dependencies, its replication state, the manual steps required for a failover, and the decision authorities at each step. This took about three weeks and was done with the relevant engineers and application owners, not just the documentation team.

What we found was consistent with what Uptime Institute's research describes. Roughly 40% of our total recovery time was people waiting for information. Where is the runbook? Who approves the failover? Is the DR environment in the state we think it is? Has the replication been verified recently?

We addressed those questions without touching a single server. Updated runbooks. Pre-authorised decision trees with named alternates for every critical role. A tested communication tree with at least two contacts at every node. The result was a significant reduction in recovery time before we changed any infrastructure.

RPO is the conversation that most teams avoid

RTO gets the attention because it is easy to measure. RPO is harder because it forces a business decision that most organisations would rather defer.

How much data are you willing to lose? The technical answer is whatever your replication lag is. The business answer involves regulatory obligations, customer commitments, revenue exposure, and risk appetite. Getting those two answers to align is political work. It requires the right people in the room and a willingness to quantify the cost of different loss scenarios.

At that Bank, I had to bring this conversation to the CRO directly. We mapped the gap between what our current replication architecture could guarantee and what Bangladesh Bank's ICT guidelines required. Once that gap was visible in those terms, the investment decision was not difficult. We got RPO under one hour. That required infrastructure work and additional spend. The business case was made in about twenty minutes.

DR drills are only useful if they are uncomfortable

Most DR drills are theatre. You notify the relevant people in advance, run through a prepared sequence, and declare success. The actual scenario is never: advance notice, clear sequence, everyone available, environment exactly as documented.

We ran two types of drills. The first was the planned variety, useful for testing updated runbooks and training new team members. The second type was announced with four hours' notice, on a day we had not pre-signalled, with a complication built in: the primary DR coordinator was marked as "unavailable" and could not be used. These drills were harder. They were also more honest.

The Veeam 2024 data shows only 13% of organisations can successfully orchestrate recovery during an actual DR situation. That number is low, and it reflects the gap between drill success and real-world capability. Planned, comfortable drills produce the first number. Unannounced, constrained drills get closer to the second.

Understanding recovery tiers: a practical framework

Not all systems need the same recovery targets. One of the most useful structural decisions we made was formally classifying applications into recovery tiers and setting RTO and RPO targets accordingly. This focused investment where it mattered and gave the team clear priorities during an actual event.

Recovery Tier	RTO Target	RPO Target	Typical Systems
Tier 1: Mission critical	Under 1 hour	Near-zero (minutes)	Core banking, payment processing, fraud systems
Tier 2: Business critical	1 to 4 hours	Under 1 hour	Customer portals, branch systems, CRM
Tier 3: Important	4 to 24 hours	Under 4 hours	Reporting, internal portals, HR systems
Tier 4: Non-critical	24 to 72 hours	Under 24 hours	Archives, development environments, test systems

Tier classifications should reflect what the business actually cannot function without, not what IT considers technically important. The distinction matters. Application owners will often argue for Tier 1 status. The right question to ask them is: what is the business impact of this system being unavailable for four hours? That anchors the conversation in operational reality rather than technical preference.

At scale, manual anything is a liability

A lot of DR guidance is written for organisations with dozens of systems. That Bank had 600-plus servers and over 6,500 managed assets. At that scale, manual state verification during a recovery event is too slow and too error-prone.

We automated the discovery and state-verification steps of the failover process using ManageEngine. This meant the team could verify replication health and DR environment readiness in minutes rather than hours. The initial asset discovery also surfaced configuration drift we had not been aware of. Addressing that drift, before an event rather than during one, was one of the more valuable side effects of the automation project.

The flood was the real test

In 2024, flooding affected one of our primary sites. The team executed the failover runbook under conditions that were genuinely stressful and time-pressured. We hit the RTO target. There was no data loss.

I would not recommend a natural disaster as a training methodology. But that activation validated everything the redesign had been built around. The team knew the process worked not because we had run drills, but because they had done it for real under pressure.

Where to start if your RTO is still measured in hours

Start with the dependency mapping exercise. Do it in person, with the people who actually operate the systems, not just the people who wrote the original documentation. The gaps it surfaces will tell you exactly where to focus.

If you find that most of your recovery time is people waiting for information rather than systems rebuilding, fix that first. It costs nothing in infrastructure budget and can significantly reduce your effective RTO before you spend a dollar on technology.

References

Veeam Software, Data Protection Trends Report 2024 (survey of 1,200 IT leaders, January 2024)
Uptime Institute, 2024 Annual Outage Analysis
IT Tool Kit, RTO vs RPO: Complete Guide to Recovery Objectives 2025