How to Test a Chatbot (Mobile, Web, and Messaging Channels)

Chatbots look simple — text in, text out — and fail in ways that no other UI fails. They misunderstand, loop, hallucinate, leak context, get jailbroken. Testing a chatbot well means covering linguisti

May 17, 2026 · 3 min read · How-To Guides

Chatbots look simple — text in, text out — and fail in ways that no other UI fails. They misunderstand, loop, hallucinate, leak context, get jailbroken. Testing a chatbot well means covering linguistic correctness, conversational state, safety, and the specific things users do that you did not predict. This guide covers the test matrix.

What to test

Intent recognition

  1. Common intents recognized with typical phrasings
  2. Synonyms and slang recognized
  3. Typos tolerated ("resrvation", "book hotl")
  4. Multiple languages if supported
  5. Intent confidence threshold appropriate — low confidence → clarify, not guess

Entity extraction

  1. Dates parsed in multiple formats ("next Friday", "2026-01-15", "tomorrow")
  2. Times parsed with timezone context
  3. Numbers with units ("100 dollars", "2 nights")
  4. Locations (city, address, landmarks)
  5. Names (user, product, etc.)

Conversation state

  1. Context retained across turns ("Book a flight to Paris." "From where?" "NYC.")
  2. Clarifying questions work when entity is missing
  3. User can correct previous input ("Actually make it Tuesday")
  4. Conversation can be restarted
  5. State timeouts (context expires after N minutes)
  6. Multiple parallel conversations isolated (different users)

Fallback and escalation

  1. Unrecognized intent → helpful fallback, not "I don't understand"
  2. Repeated failure → escalate to human
  3. Emergency / high-risk intent → always escalate or show safety resources

Response quality

  1. Answers factually correct (hallucination detection)
  2. Responses concise — no wall of text for a simple question
  3. Markdown / formatting renders correctly in chat UI
  4. Links valid and open correctly
  5. No stale information — dates, prices, availability up-to-date

Safety

  1. Harmful prompts refused (self-harm, violence, illegal advice)
  2. PII in user messages handled appropriately (not echoed publicly, not logged)
  3. Financial advice disclaimed
  4. Medical advice disclaimed or rejected
  5. Minor detection triggers appropriate safeguards
  6. Prompt injection attempts rejected ("Ignore previous instructions...")

Multi-turn coherence

  1. Bot does not contradict itself across turns
  2. Bot remembers what it promised ("I'll check that" actually follows up)
  3. Bot does not lose thread on long conversations

Voice / speech (if supported)

  1. Speech-to-text accurate
  2. Voice responses natural
  3. Interruption handled (user speaks over bot)
  4. Silence timeout appropriate

Channels

  1. Web widget responsive
  2. Mobile app chat renders correctly
  3. SMS / WhatsApp / Messenger formatting correct per channel
  4. Channel-specific features (buttons, cards, carousels) render

Performance

  1. First response within 2 seconds
  2. Streaming responses start within 500ms
  3. No disconnect on long conversations
  4. Concurrent sessions scale

Accessibility

  1. Screen reader announces incoming messages
  2. Text input labeled
  3. Timestamps readable
  4. High-contrast mode respected
  5. Audio responses have text alternative

How to build test cases

Scripted happy paths

Write down 20-30 common user goals. Execute each, end-to-end. Expected paths.

Adversarial scripts

Invalid inputs, rude language, off-topic, prompt injection attempts. Expected: graceful handling.

Golden sets

Maintain a set of (input, expected ideal response) pairs. Run on every deploy. Measure semantic similarity of actual vs ideal.

A/B against humans

Periodically have human experts rate bot responses on quality. Track quality score over time.

How SUSA handles chatbot testing

SUSA drives chat UIs with scripted conversation sequences, persona-appropriate language (novice is careful, impatient is terse), and detects:

For deeper conversational evaluation, pair SUSA with a dedicated conversational QA tool (or dedicated voice/chat QA platform).


susatest-agent test chatapp.apk --persona novice --steps 100

Common production bugs

  1. Bot repeats same response to any input — intent recognition broken
  2. Context leaks across users — one user's name mentioned to another
  3. Bot confidently gives wrong information — hallucination not caught
  4. Handoff to human never fires — escalation logic broken
  5. Bot loops when it does not know — "I'll check... checking... let me check..."
  6. Emoji / unicode breaks formatting — special chars unescaped

Chatbot quality is a moving target. Build test suites that grow with your intent catalog, evaluate continuously, and involve humans for quality ratings.

Test Your App Autonomously

Upload your APK or URL. SUSA explores like 10 real users — finds bugs, accessibility violations, and security issues. No scripts.

Try SUSA Free