How to Test a Chatbot (Mobile, Web, and Messaging Channels)

May 17, 2026 · 3 min read · How-To Guides

Chatbots look simple — text in, text out — and fail in ways that no other UI fails. They misunderstand, loop, hallucinate, leak context, get jailbroken. Testing a chatbot well means covering linguistic correctness, conversational state, safety, and the specific things users do that you did not predict. This guide covers the test matrix.

What to test

Intent recognition

Common intents recognized with typical phrasings
Synonyms and slang recognized
Typos tolerated ("resrvation", "book hotl")
Multiple languages if supported
Intent confidence threshold appropriate — low confidence → clarify, not guess

Entity extraction

Dates parsed in multiple formats ("next Friday", "2026-01-15", "tomorrow")
Times parsed with timezone context
Numbers with units ("100 dollars", "2 nights")
Locations (city, address, landmarks)
Names (user, product, etc.)

Conversation state

Context retained across turns ("Book a flight to Paris." "From where?" "NYC.")
Clarifying questions work when entity is missing
User can correct previous input ("Actually make it Tuesday")
Conversation can be restarted
State timeouts (context expires after N minutes)
Multiple parallel conversations isolated (different users)

Fallback and escalation

Unrecognized intent → helpful fallback, not "I don't understand"
Repeated failure → escalate to human
Emergency / high-risk intent → always escalate or show safety resources

Response quality

Answers factually correct (hallucination detection)
Responses concise — no wall of text for a simple question
Markdown / formatting renders correctly in chat UI
Links valid and open correctly
No stale information — dates, prices, availability up-to-date

Safety

Harmful prompts refused (self-harm, violence, illegal advice)
PII in user messages handled appropriately (not echoed publicly, not logged)
Financial advice disclaimed
Medical advice disclaimed or rejected
Minor detection triggers appropriate safeguards
Prompt injection attempts rejected ("Ignore previous instructions...")

Multi-turn coherence

Bot does not contradict itself across turns
Bot remembers what it promised ("I'll check that" actually follows up)
Bot does not lose thread on long conversations

Voice / speech (if supported)

Speech-to-text accurate
Voice responses natural
Interruption handled (user speaks over bot)
Silence timeout appropriate

Channels

Web widget responsive
Mobile app chat renders correctly
SMS / WhatsApp / Messenger formatting correct per channel
Channel-specific features (buttons, cards, carousels) render

Performance

First response within 2 seconds
Streaming responses start within 500ms
No disconnect on long conversations
Concurrent sessions scale

Accessibility

Screen reader announces incoming messages
Text input labeled
Timestamps readable
High-contrast mode respected
Audio responses have text alternative

How to build test cases

Scripted happy paths

Write down 20-30 common user goals. Execute each, end-to-end. Expected paths.

Adversarial scripts

Invalid inputs, rude language, off-topic, prompt injection attempts. Expected: graceful handling.

Golden sets

Maintain a set of (input, expected ideal response) pairs. Run on every deploy. Measure semantic similarity of actual vs ideal.

A/B against humans

Periodically have human experts rate bot responses on quality. Track quality score over time.

How SUSA handles chatbot testing

SUSA drives chat UIs with scripted conversation sequences, persona-appropriate language (novice is careful, impatient is terse), and detects:

Failed turns (user message → bot error or "I don't understand")
Looping responses
Broken formatting
Accessibility issues in chat components

For deeper conversational evaluation, pair SUSA with a dedicated conversational QA tool (or dedicated voice/chat QA platform).


susatest-agent test chatapp.apk --persona novice --steps 100

Common production bugs

Bot repeats same response to any input — intent recognition broken
Context leaks across users — one user's name mentioned to another
Bot confidently gives wrong information — hallucination not caught
Handoff to human never fires — escalation logic broken
Bot loops when it does not know — "I'll check... checking... let me check..."
Emoji / unicode breaks formatting — special chars unescaped

Chatbot quality is a moving target. Build test suites that grow with your intent catalog, evaluate continuously, and involve humans for quality ratings.

Test Your App Autonomously

Upload your APK or URL. SUSA explores like 10 real users — finds bugs, accessibility violations, and security issues. No scripts.

Try SUSA Free