Chaos Testing: Finding Reliability Gaps Before Users Do

Chaos testing deliberately breaks things to see how your app handles breakage. Network drops, service failures, corrupted data, latency spikes, memory pressure. If the app degrades gracefully under st

January 19, 2026 · 3 min read · Testing Guides

Chaos testing deliberately breaks things to see how your app handles breakage. Network drops, service failures, corrupted data, latency spikes, memory pressure. If the app degrades gracefully under stress, it is reliable. If it cascades, you have a production incident waiting.

What chaos tests

Different from "stress testing" (max load). Chaos tests arbitrary failures:

The question is not "does the system handle X?" but "does the system fail in a way that surprises us?"

Why it matters

Production incidents come from combinations: your service healthy, dependency flaky, retry logic buggy. Chaos testing surfaces the combinations.

Principles

1. Production is the best test environment

Chaos on staging is useful; chaos on production (carefully) is revealing. Netflix started there.

2. Start small, increase blast radius

First chaos: one pod in one AZ. If fine, two pods. Etc.

3. Have a game day

Dedicated time where engineers run chaos scenarios and observe. Teaches the system's real behavior.

4. Always have a stop button

One command kills all chaos. No "we'll wait it out" in the middle of a bad run.

Tools

Infrastructure

Network

Mobile-specific

Application-level

Scenarios

Network

Dependencies

State

Application

What to observe

How SUSA does chaos

SUSA's network_tester runs scripted chaos per exploration:

Each verifies app's degradation and recovery. Reports flag flows that failed under chaos.


susatest-agent test myapp.apk --network packet_loss --steps 100
susatest-agent test myapp.apk --network network_recovery --steps 100

For backend chaos, pair SUSA with infrastructure chaos tools. SUSA drives the client; chaos tools inject failures on the server side.

Starting chaos testing

  1. Define blast radius. What users are affected by chaos run?
  2. Define stop conditions. At what error rate do we abort?
  3. Choose initial scenario. Simple: one dependency slow for 5 min.
  4. Run in staging first. Build confidence.
  5. Run in production with on-call. Real test.
  6. Review findings. What surprised?
  7. Fix / mitigate. Repeat quarterly.

Common findings

Anti-patterns

"Chaos testing" = fire drills

Running chaos without systems ready to observe, alerts configured, or game plan.

Chaos without hypothesis

"Let's see what breaks" without a specific scenario or expected behavior.

Chaos without stop criteria

Runs keep going after system is clearly broken. Should bail fast.

Chaos without follow-through

Findings noted, never fixed. Waste.

Chaos testing is about building confidence through deliberate failure. Practice under controlled conditions; handle real incidents better.

Test Your App Autonomously

Upload your APK or URL. SUSA explores like 10 real users — finds bugs, accessibility violations, and security issues. No scripts.

Try SUSA Free