← Back to blog

May 7, 2026

Comparing AI for Therapy

There’s a problem with how we benchmark AI for mental health.

AI therapy gets strong reactions. Critics call it dangerous, a simplification that flattens complex mental health into a chat window. They have a point about one specific failure mode: models trained to be agreeable tend to validate whatever the user brings, reinforcing distorted thinking rather than challenging it. That’s the opposite of what therapy is supposed to do.

On the other side, people are already using AI heavily for mental wellbeing, across a wide spectrum. Some just need to vent after a hard day. Others have serious ongoing issues and have quietly replaced their therapist with a chatbot. The r/therapyGPT community is one visible example of how widespread this has become.

Both groups are largely working with the same thing: an app like ChatGPT, prompted on a “feels right” basis, with no clinical grounding, no safety guardrails, and no real thought about what makes AI mental health support safe or effective. So we decided to run a benchmark against something that was actually designed for this.

What we actually did

We ran one of the Withease production system prompts through MindEval, a research framework for evaluating AI mental health systems. MindEval generates 50 synthetic patient profiles spanning a wide demographic and clinical range, simulates a multi-turn therapy session per profile, then scores each conversation across five dimensions: clinical accuracy, ethical conduct, assessment quality, therapeutic relationship, and AI-specific communication quality.

We compared our results against the published MindEval paper baseline for Claude Sonnet 4.5, which uses the MindEval generic clinician prompt. Our run uses the same model with the Withease production prompt, adapted to pass MindEval’s patient profile fields.

The results

DimensionPaper baselineWitheaseDelta
Clinical Accuracy & Competence3.794.17+0.38
Ethical & Professional Conduct4.354.52+0.17
Assessment & Response3.674.34+0.67
Therapeutic Relationship & Alliance3.664.49+0.83
AI Communication Quality2.943.75+0.81
Overall3.684.25+0.57

A +0.57 overall gain from a system prompt alone is not what most people would expect. No fine-tuning, no architectural changes, just a few hundred words of natural language instructions telling the model how to behave. The most striking individual number is AI Communication Quality. The paper’s baseline of 2.94 sits below the scale midpoint. That means the generic MindEval prompt produces noticeably mechanical output on this dimension. The Withease prompt has explicit style rules: conversational tone, no hollow validation, a hard limit on how much the AI piles into one response. That pushes the score up by 0.81.

Therapeutic Relationship & Alliance follows a similar pattern. The Withease prompt has explicit instructions around collaborative framing and autonomy support. The judge rubric rewards both directly, and the +0.83 gain reflects that.

Ethical & Professional Conduct shows the smallest gain (+0.17) because the paper baseline is already strong at 4.35. There’s less room to move.

Benchmarking weakness

MindEval has a structural problem worth flagging. The judge prompt instructs that scores of 5 and 6 should be rare, reserved for “truly exceptional performance,” with most conversations expected to fall in the 2–4 range. In practice, our run landed almost entirely between 4.0 and 4.9. A score of 5 was never reached overall.

The scale can’t distinguish between good and great. A more principled approach would be comparative judgment (which system did better on this session, the method LMSYS Chatbot Arena uses) or dropping the pre-calibration instruction entirely. As models improve, the “rare 5s” instruction will become increasingly miscalibrated, and systems that are genuinely better will still be capped in the 4s.

Conclusion

Used alongside therapy and not as a replacement, with proper grounding and explicit safety rules, a well-designed AI system performs meaningfully differently from what critics typically mention. Try it: App Store