Mean time to recovery (MTTR) refers to the average time it takes a system to recover fully from failure. When this amount of time has passed, the device should be fully operational again. It includes the entire outage time and time spent in-between testing, repair, restoration, and resolution. The MTTR of every system varies.
Imagine a person who hurt his ankle. In his case, MTTR starts from when he broke his ankle to the time it heals fully, and he can walk again without feeling any pain.
Read More about “Mean Time to Recovery”
How Do You Calculate Mean Time to Recovery?
To know what a device’s MTTR is, add all the downtime when an incident occurs. Make sure you separate one incident from all others.
Let us say your system was down for 1 hour within a day when two incidents occurred. Divide 1 hour by 2 to limit your calculation to 1 incident. The downtime per incident, therefore, will be 30 minutes. You spent 15 minutes rebooting the device until it runs again wholly. Add 15 minutes to 30 minutes. The system’s MTTR is 45 minutes.
Does Calculating Mean Time to Recovery Have Limitations?
MTTR measures the speed of a full recovery process. It tells you if it is as fast as you want it to be. If you are a product developer, you can use it for competitor comparison. That should give you an idea of how resilient your software is, for instance, to downtime.
MTTR is, however, a high-level metric. It helps you identify problems with your product. But if you want to know where the problem lies, you will need more data. And that is not surprising since a lot of things occur between system failure and recovery.
MTTR can be affected by a device’s alert system. There may be a delay between a failure and when an alert is issued. Sometimes, warnings may take longer to get to the right person.
At times, diagnostics can affect MTTR, too. You should be able to determine what the problem is. Note down processes that require improvement as well.
Another drawback could have to do with repairs. Is your maintenance team as effective as it could be? If it is not, what could be causing the delay?
You will, of course, need to look beyond MTTR to answer those questions. But MTTR can provide a good starting point.
What Makes Up Mean Time to Recovery?
MTTR is not a single metric. It actually comprises four different measurements—mean time between failures (MTBF), mean time to repair, mean time to restore, and mean time to resolve. You need to add all these to the downtime.
Mean Time between Failures
Mean time between failures (MTBF) refers to the average amount of time between repairable failures. It tracks the availability and reliability of a product. The higher the MTBF, the more reliable a system is.
As such, most vendors strive to keep the MTBF as high as possible. A product should have hundreds of thousands of hours between issues.
Mean Time to Repair
Mean time to repair refers to the average amount of time it takes to repair a system technically or mechanically. It includes the repair and any testing time. You should not stop counting time until the system functions fully again.
Mean Time to Restore
Mean time to restore refers to the average time it takes to recover from a device failure. It includes the outage’s occurrence.
Mean Time to Resolve
Mean time to resolve refers to the average amount of time it takes to resolve a failure fully. It includes the time spent detecting the failure, diagnosing the problem, repairing the issue, and ensuring the problem will not happen again.
This metric requires the fixer to improve system performance in the long term. It has a lot to do with customer satisfaction, so vendors should pay attention to it.
Staying always on is a must. Glitches and downtime come with real consequences. They can result in missed deadlines, late payments, and project delays, which affect organizations’ bottom line.