How to Test a Chatbot (Mobile, Web, and Messaging Channels)
Chatbots look simple — text in, text out — and fail in ways that no other UI fails. They misunderstand, loop, hallucinate, leak context, get jailbroken. Testing a chatbot well means covering linguisti
Chatbots look simple — text in, text out — and fail in ways that no other UI fails. They misunderstand, loop, hallucinate, leak context, get jailbroken. Testing a chatbot well means covering linguistic correctness, conversational state, safety, and the specific things users do that you did not predict. This guide covers the test matrix.
What to test
Intent recognition
- Common intents recognized with typical phrasings
- Synonyms and slang recognized
- Typos tolerated ("resrvation", "book hotl")
- Multiple languages if supported
- Intent confidence threshold appropriate — low confidence → clarify, not guess
Entity extraction
- Dates parsed in multiple formats ("next Friday", "2026-01-15", "tomorrow")
- Times parsed with timezone context
- Numbers with units ("100 dollars", "2 nights")
- Locations (city, address, landmarks)
- Names (user, product, etc.)
Conversation state
- Context retained across turns ("Book a flight to Paris." "From where?" "NYC.")
- Clarifying questions work when entity is missing
- User can correct previous input ("Actually make it Tuesday")
- Conversation can be restarted
- State timeouts (context expires after N minutes)
- Multiple parallel conversations isolated (different users)
Fallback and escalation
- Unrecognized intent → helpful fallback, not "I don't understand"
- Repeated failure → escalate to human
- Emergency / high-risk intent → always escalate or show safety resources
Response quality
- Answers factually correct (hallucination detection)
- Responses concise — no wall of text for a simple question
- Markdown / formatting renders correctly in chat UI
- Links valid and open correctly
- No stale information — dates, prices, availability up-to-date
Safety
- Harmful prompts refused (self-harm, violence, illegal advice)
- PII in user messages handled appropriately (not echoed publicly, not logged)
- Financial advice disclaimed
- Medical advice disclaimed or rejected
- Minor detection triggers appropriate safeguards
- Prompt injection attempts rejected ("Ignore previous instructions...")
Multi-turn coherence
- Bot does not contradict itself across turns
- Bot remembers what it promised ("I'll check that" actually follows up)
- Bot does not lose thread on long conversations
Voice / speech (if supported)
- Speech-to-text accurate
- Voice responses natural
- Interruption handled (user speaks over bot)
- Silence timeout appropriate
Channels
- Web widget responsive
- Mobile app chat renders correctly
- SMS / WhatsApp / Messenger formatting correct per channel
- Channel-specific features (buttons, cards, carousels) render
Performance
- First response within 2 seconds
- Streaming responses start within 500ms
- No disconnect on long conversations
- Concurrent sessions scale
Accessibility
- Screen reader announces incoming messages
- Text input labeled
- Timestamps readable
- High-contrast mode respected
- Audio responses have text alternative
How to build test cases
Scripted happy paths
Write down 20-30 common user goals. Execute each, end-to-end. Expected paths.
Adversarial scripts
Invalid inputs, rude language, off-topic, prompt injection attempts. Expected: graceful handling.
Golden sets
Maintain a set of (input, expected ideal response) pairs. Run on every deploy. Measure semantic similarity of actual vs ideal.
A/B against humans
Periodically have human experts rate bot responses on quality. Track quality score over time.
How SUSA handles chatbot testing
SUSA drives chat UIs with scripted conversation sequences, persona-appropriate language (novice is careful, impatient is terse), and detects:
- Failed turns (user message → bot error or "I don't understand")
- Looping responses
- Broken formatting
- Accessibility issues in chat components
For deeper conversational evaluation, pair SUSA with a dedicated conversational QA tool (or dedicated voice/chat QA platform).
susatest-agent test chatapp.apk --persona novice --steps 100
Common production bugs
- Bot repeats same response to any input — intent recognition broken
- Context leaks across users — one user's name mentioned to another
- Bot confidently gives wrong information — hallucination not caught
- Handoff to human never fires — escalation logic broken
- Bot loops when it does not know — "I'll check... checking... let me check..."
- Emoji / unicode breaks formatting — special chars unescaped
Chatbot quality is a moving target. Build test suites that grow with your intent catalog, evaluate continuously, and involve humans for quality ratings.
Test Your App Autonomously
Upload your APK or URL. SUSA explores like 10 real users — finds bugs, accessibility violations, and security issues. No scripts.
Try SUSA Free