Chaos Testing: Finding Reliability Gaps Before Users Do

Chaos testing deliberately breaks things to see how your app handles breakage. Network drops, service failures, corrupted data, latency spikes, memory pressure. If the app degrades gracefully under st

January 19, 2026 · 3 min read · Testing Guides

What chaos tests

Different from "stress testing" (max load). Chaos tests arbitrary failures:

Network partition between app and server
Dependency service down (Redis, DB, auth provider)
Latency injection (500ms on API calls)
CPU / memory pressure
Disk full
Clock skew
Packet loss / corruption

The question is not "does the system handle X?" but "does the system fail in a way that surprises us?"

Why it matters

Production incidents come from combinations: your service healthy, dependency flaky, retry logic buggy. Chaos testing surfaces the combinations.

Principles

1. Production is the best test environment

Chaos on staging is useful; chaos on production (carefully) is revealing. Netflix started there.

2. Start small, increase blast radius

First chaos: one pod in one AZ. If fine, two pods. Etc.

3. Have a game day

Dedicated time where engineers run chaos scenarios and observe. Teaches the system's real behavior.

4. Always have a stop button

One command kills all chaos. No "we'll wait it out" in the middle of a bad run.

Tools

Infrastructure

Gremlin (commercial)
Chaos Monkey (Netflix, open)
Litmus (Kubernetes-native, open)
Chaos Toolkit (flexible)
AWS Fault Injection Simulator

Network

tc (traffic control, Linux)
Chaoskube
Network Link Conditioner (macOS)

Mobile-specific

Rooted device + custom iptables
Network proxy with fault injection (Charles, mitmproxy)

Application-level

Feature flag flips
Chaos HTTP interceptors (random 500s, random 3s latency)

Scenarios

Network

Full outage (offline simulation)
High latency (2s+ per request)
Packet loss (20% drop)
Slow bandwidth (3G speeds)
Flapping (online / offline toggle every 30s)

Dependencies

DB down
Auth service down
Cache down (all requests go to DB)
CDN down
One partner API down

State

Full disk
Corrupted database
Clock skew (30 minutes off)
Memory exhaustion

Application

Random 500 errors from API (5% of requests)
Random 500ms latency on API
Specific user account in locked state
Feature flag flipped unexpectedly

What to observe

Did the app crash?
Did the app freeze?
Did user see a clear error?
Did user lose data?
Did the app recover when conditions normalized?
Did monitoring detect and alert?
Did automated remediation trigger?

How SUSA does chaos

SUSA's network_tester runs scripted chaos per exploration:

2G slowness
High latency
Packet loss
Offline
Recovery (offline → online)

Each verifies app's degradation and recovery. Reports flag flows that failed under chaos.


susatest-agent test myapp.apk --network packet_loss --steps 100
susatest-agent test myapp.apk --network network_recovery --steps 100

For backend chaos, pair SUSA with infrastructure chaos tools. SUSA drives the client; chaos tools inject failures on the server side.

Starting chaos testing

Define blast radius. What users are affected by chaos run?
Define stop conditions. At what error rate do we abort?
Choose initial scenario. Simple: one dependency slow for 5 min.
Run in staging first. Build confidence.
Run in production with on-call. Real test.
Review findings. What surprised?
Fix / mitigate. Repeat quarterly.

Common findings

Retries exacerbate outages. 1 retry per client × 10k clients = 10k extra load during downstream slow.
Timeouts stacked. Client timeout 30s; app-level 60s; network 120s. Requests pile up during degradation.
No circuit breaker. Every request hits failing dependency; no backoff.
Partial failure not handled. Some data loads, some 500s; UI shows blank for failed data.
Cascading failures. One slow service backs up upstream service's thread pool.

Anti-patterns

"Chaos testing" = fire drills

Running chaos without systems ready to observe, alerts configured, or game plan.

Chaos without hypothesis

"Let's see what breaks" without a specific scenario or expected behavior.

Chaos without stop criteria

Runs keep going after system is clearly broken. Should bail fast.

Chaos without follow-through

Findings noted, never fixed. Waste.

Chaos testing is about building confidence through deliberate failure. Practice under controlled conditions; handle real incidents better.

Test Your App Autonomously

Upload your APK or URL. SUSA explores like 10 real users — finds bugs, accessibility violations, and security issues. No scripts.

Try SUSA Free

Chaos Testing: Finding Reliability Gaps Before Users Do

What chaos tests

Why it matters

Principles

1. Production is the best test environment

2. Start small, increase blast radius

3. Have a game day

4. Always have a stop button

Tools

Infrastructure

Network

Mobile-specific

Application-level

Scenarios

Network

Dependencies

State

Application

What to observe

How SUSA does chaos

Starting chaos testing

Common findings

Anti-patterns

"Chaos testing" = fire drills

Chaos without hypothesis

Chaos without stop criteria

Chaos without follow-through

Test Your App Autonomously

Related Articles