Chaos Testing: Finding Reliability Gaps Before Users Do
Chaos testing deliberately breaks things to see how your app handles breakage. Network drops, service failures, corrupted data, latency spikes, memory pressure. If the app degrades gracefully under st
Chaos testing deliberately breaks things to see how your app handles breakage. Network drops, service failures, corrupted data, latency spikes, memory pressure. If the app degrades gracefully under stress, it is reliable. If it cascades, you have a production incident waiting.
What chaos tests
Different from "stress testing" (max load). Chaos tests arbitrary failures:
- Network partition between app and server
- Dependency service down (Redis, DB, auth provider)
- Latency injection (500ms on API calls)
- CPU / memory pressure
- Disk full
- Clock skew
- Packet loss / corruption
The question is not "does the system handle X?" but "does the system fail in a way that surprises us?"
Why it matters
Production incidents come from combinations: your service healthy, dependency flaky, retry logic buggy. Chaos testing surfaces the combinations.
Principles
1. Production is the best test environment
Chaos on staging is useful; chaos on production (carefully) is revealing. Netflix started there.
2. Start small, increase blast radius
First chaos: one pod in one AZ. If fine, two pods. Etc.
3. Have a game day
Dedicated time where engineers run chaos scenarios and observe. Teaches the system's real behavior.
4. Always have a stop button
One command kills all chaos. No "we'll wait it out" in the middle of a bad run.
Tools
Infrastructure
- Gremlin (commercial)
- Chaos Monkey (Netflix, open)
- Litmus (Kubernetes-native, open)
- Chaos Toolkit (flexible)
- AWS Fault Injection Simulator
Network
tc(traffic control, Linux)- Chaoskube
- Network Link Conditioner (macOS)
Mobile-specific
- Rooted device + custom iptables
- Network proxy with fault injection (Charles, mitmproxy)
Application-level
- Feature flag flips
- Chaos HTTP interceptors (random 500s, random 3s latency)
Scenarios
Network
- Full outage (offline simulation)
- High latency (2s+ per request)
- Packet loss (20% drop)
- Slow bandwidth (3G speeds)
- Flapping (online / offline toggle every 30s)
Dependencies
- DB down
- Auth service down
- Cache down (all requests go to DB)
- CDN down
- One partner API down
State
- Full disk
- Corrupted database
- Clock skew (30 minutes off)
- Memory exhaustion
Application
- Random 500 errors from API (5% of requests)
- Random 500ms latency on API
- Specific user account in locked state
- Feature flag flipped unexpectedly
What to observe
- Did the app crash?
- Did the app freeze?
- Did user see a clear error?
- Did user lose data?
- Did the app recover when conditions normalized?
- Did monitoring detect and alert?
- Did automated remediation trigger?
How SUSA does chaos
SUSA's network_tester runs scripted chaos per exploration:
- 2G slowness
- High latency
- Packet loss
- Offline
- Recovery (offline → online)
Each verifies app's degradation and recovery. Reports flag flows that failed under chaos.
susatest-agent test myapp.apk --network packet_loss --steps 100
susatest-agent test myapp.apk --network network_recovery --steps 100
For backend chaos, pair SUSA with infrastructure chaos tools. SUSA drives the client; chaos tools inject failures on the server side.
Starting chaos testing
- Define blast radius. What users are affected by chaos run?
- Define stop conditions. At what error rate do we abort?
- Choose initial scenario. Simple: one dependency slow for 5 min.
- Run in staging first. Build confidence.
- Run in production with on-call. Real test.
- Review findings. What surprised?
- Fix / mitigate. Repeat quarterly.
Common findings
- Retries exacerbate outages. 1 retry per client × 10k clients = 10k extra load during downstream slow.
- Timeouts stacked. Client timeout 30s; app-level 60s; network 120s. Requests pile up during degradation.
- No circuit breaker. Every request hits failing dependency; no backoff.
- Partial failure not handled. Some data loads, some 500s; UI shows blank for failed data.
- Cascading failures. One slow service backs up upstream service's thread pool.
Anti-patterns
"Chaos testing" = fire drills
Running chaos without systems ready to observe, alerts configured, or game plan.
Chaos without hypothesis
"Let's see what breaks" without a specific scenario or expected behavior.
Chaos without stop criteria
Runs keep going after system is clearly broken. Should bail fast.
Chaos without follow-through
Findings noted, never fixed. Waste.
Chaos testing is about building confidence through deliberate failure. Practice under controlled conditions; handle real incidents better.
Test Your App Autonomously
Upload your APK or URL. SUSA explores like 10 real users — finds bugs, accessibility violations, and security issues. No scripts.
Try SUSA Free