How to Conduct a Root Cause Analysis

Sadha Moodley
2 min readOct 16, 2023
Image source: https://upskillnation.com/root-cause-analysis/

A Root Cause Analysis (RCA) is the process of identifying the root cause of a problem. The idea is to find the source of the problem so that we do not focus on interventions related to the symptoms. As an example, a service might be reaching capacity due to a down-stream service. Adding more servers would be focusing on the symptom, this could be a good intervention in the short-term but the longer-term plan should ideally be around improving the speed of the dependency or ensuring that your service is resilient to down-stream slowdowns.

The root cause analysis should ideally be done as a session where stakeholders who understand the system go through the events that happened together. This starts with documenting everything that happened and the steps taken to restore service followed by a deep dive into what went wrong and why. Below is a list of principles which will help in ensuring that the root cause analysis is effective:

Important Principles

  • Identify initiatives that can solve the root cause.
  • Understand why something happened.
  • Document the response as well so that we can see if this was optimal and whether the team might need to improve their response to incidents.
  • Ensure that the RCA is blameless, the focus should be on the problem and why it occurred and less on people, this ensures that everyone will commit to the RCA process.
  • Document the issue in as much detail as possible so that this can be used to come up with interventions to prevent the problem from re-occurring.

The 5 Why’s Technique

The 5 why’s method is an approach where you document the problem and then ask why it occurred. At each step you check if the answer is the actual root cause. If it is not the root cause, you ask why again until you get to the root cause.

Image source: https://cx-journey.com

Document Planned Interventions

After finding the root cause, we should document interventions that we could implement to prevent the issue from happening again. This should include both short-term (such as adding more servers) and long-term interventions (such as implementing circuit breakers, fixing dependencies).

--

--

Sadha Moodley

AWS Architect | Technical Writer. Wanna work together? Connect with me on Linkedin: https://www.linkedin.com/in/sadha-moodley/