Chaos engineering refers to the process of putting a system through a series of tests to build up its resilience to turbulence or unexpected conditions. You can think of it as a stress test to see how much wear and tear your system can take.
It puts to mind how astronauts get trained in preparation for their trip to space where a lot of things can go wrong. Astronauts in training get subjected to really harsh turbulence simulations to ensure they can withstand the shaking and pressure of liftoff and avoid meteors brought on by an unexpected shower or other issues.
Read More about “Chaos Engineering”
What Is Chaos Engineering For?
Chaos engineering ensures that software and hardware continue to work despite hiccups they may encounter along the way. That’s especially important today when companies can’t function without the help of computers. Component failures can cause operation disruption and production breakdowns.
Chaos engineering can reduce the occurrence of infrastructure, network, and application failures.
Usual Chaos Engineering Techniques
To address the failures mentioned above, organizations employ chaos engineering tactics that include:
- Server shutdown: This technique involves turning off servers one by one to see how the system holds up.
- Latency injection: This tactic requires delaying communication to copy what happens when power or connection speed fluctuates to determine if the system will hold its own.
- Resource exhaustion: This approach hogs the system’s resources to ensure it will not shut down entirely but instead work continuously until the process is completed.
Steps Involved in Chaos Engineering
Identifying system weaknesses effectively requires performing a number of steps, namely:
- Define the system’s steady state (that moment when it behaves as it should). This state is achieved when it turns out the desired number of outputs, indicating that it’s behaving normally.
- Put up two systems (a control system and an experimental one). Both should continue working at the identified steady state.
- Simulate the occurrence of unexpected variables that reflect real-world events on the experimental system. Examples would be a server crash, malfunctioning hard drives, and disrupted network connection.
- Try to disprove the hypothesis by looking for differences between the control and experimental systems.
The fewer differences you spot, the better. That means your system can withstand as many potential failures as possible. Note that all the problems found should be addressed to improve the system’s performance.
Popular Chaos Engineering Aids
Fortunately for organizations that want to test the resilience of their infrastructure, network, and applications, others have already paved the way. Various software are now available to make the chaos engineering more manageable, including the ones below.
As the name suggests, Latency Monkey creates artificial delays in client-to-server communication. That simulates service degradation or downtime to ensure that upstream services continue to respond properly. The delays test a system’s tolerance and resilience when new features and services are added without affecting dependencies.
This tool automatically finds instances in a system that don’t meet the engineering team’s best practices. It identifies issues before they crop up, allowing the team to fix what’s wrong.
Given such a name, it’s quite clear that it’s supposed to clean up the system. Janitor Monkey finds “trash” in a user’s cloud environment and deletes these unused resources.
If you noticed the tools’ names, they’re pretty similar. That’s because the above-mentioned tools were created and are used by a single company—Netflix. The organization is a well-known adopter of chaos engineering since its streaming service needs to be always on. It needs to ensure that even if one of its systems fail, its service won’t get interrupted, resulting in angry subscribers.
Organizations that rely on computers for their operations (practically any kind of business today) and those that need to keep their services available at all times (streaming service and other utilities providers) can benefit from the employment of chaos engineering.