Mean time to recovery (MTTR) is one of the four DORA metrics.
It measures how quickly you restore service after a production failure — the metric that answers "when something breaks, how fast are we back?"
What is MTTR?
MTTR stands for Mean Time To Recovery — the average time it takes to restore service after a failure in production. Together with change failure rate, it forms the stability half of DORA — while deployment frequency and lead time measure throughput.
The clock starts when a failure begins affecting users and stops when normal service is restored. The "mean" is taken across incidents over your measurement window.
MTTR benchmarks
From the 2024 State of DevOps Report:
- Elite: Less than one hour
- High: Less than one day
- Medium: One day to one week
- Low: More than one week
Elite
Under 1 hour
High
Under 1 day
Medium
1 day – 1 week
Low
Over 1 week
Notice that throughput and stability are not in tension. Elite teams deploy more often and recover faster — because small, frequent changes are easier to diagnose and roll back than large, infrequent ones.
A note on the name
"MTTR" is ambiguous in the wider industry — it gets expanded as mean time to recovery, repair, respond, or resolve, which are genuinely different intervals. DORA uses recovery: time to restore service, not time to ship a permanent fix. Pick one definition and hold it constant, or your trend line is meaningless.
The MTTR formula
MTTR is a simple average. Over your measurement window, take the total time
your service spent in a failed state and divide it by the number of incidents:
MTTR = total downtime ÷ number of incidents
If you had four incidents in a month totalling 6 hours of degraded service, your MTTR is 6 ÷ 4 = 1.5 hours. The clock for each incident starts when the failure begins affecting users and stops when normal service is restored — so the quality of your measurement depends entirely on detecting the start accurately. Teams that reconstruct incident start times from memory after the fact almost always understate MTTR.
What shortens MTTR
- Fast detection. You can't recover from what you haven't noticed. The largest chunk of MTTR is often the gap between failure and someone realizing it.
- Clear ownership. An incident with no obvious owner sits while people figure out whose problem it is.
- Easy rollback. If reverting is a one-click operation, recovery is fast. If it's a manual scramble, it isn't.
- Context at hand. The commit, the deploy, and the failing check in one place beats hunting across four dashboards.
How automated detection cuts MTTR
The detection gap is where an engineering intelligence platform helps most. When CI fails on main or a deployment fails, Deviera opens a structured incident immediately — tagged to the commit author, with the failing run attached — and posts it to Slack. There's no waiting for someone to notice. When the next build goes green, it auto-resolves the incident and records the recovery time, so MTTR is measured automatically rather than reconstructed after the fact.
For a step-by-step plan to move MTTR alongside the other three DORA keys, see how to reduce MTTR and our 4-week DORA roadmap.
