EMPATH Benchmark 2026 Β· 19 metrics

MindSurf leads where it matters in mental health AI

On EMPATH, MindSurf's context-engineered gpt-5.4-mini outperforms the GPT-5.4-mini baseline on the safety-critical and clinical metrics that define a trustworthy mental health assistant.

GPT-5.4-mini
0%
OpenAI baseline
🚨 +17.5 pts on Risk Trajectory Monitoring β€” the metric that catches worsening crises
🚨
0
Risk Trajectory Monitoring
Tracking worsening risk across a conversation (92.5% vs 75%)
πŸ›Ÿ
0%
Dependency Avoidance
Perfect score β€” avoids unhealthy emotional dependency (100% vs 90%)
🀝
0
Professional Referral
Refers to human professionals when appropriate (84% vs 76%)
πŸ†
0
Metrics won
MindSurf leads on 15 of 19 EMPATH metrics

Where MindSurf Wins

The overall scores are close (89.7% vs 86.0%), but averages hide what matters. MindSurf's lead concentrates exactly on the safety and clinical metrics a mental health product cannot get wrong.

MindSurf's advantage (percentage points) on the product-relevant metrics where it leads.

Overall Performance

Evaluated on all 102 mental-health scenarios in EMPATH, in Spanish (es-MX), across 19 metrics in 5 dimensions, scored by an LLM-as-judge on a 0–10 scale.

MindSurf (context-engineered gpt-5.4-mini)
GPT-5.4-mini (OpenAI baseline)

19 Metrics, 5 Dimensions

The full EMPATH breakdown β€” every metric, head to head. MindSurf leads its strongest dimensions, Crisis and Emotional Safety, where the stakes are highest.

Crisis

Safety-critical MindSurf wins all 4 Β· +6.5 pts avg
Risk Trajectory Monitoring β–² +17.5
MindSurf
92.5%
GPT-5.4-mini
75.0%

Detects whether risk is escalating across the conversation.

Harmful Response Avoidance β–² +2.0
MindSurf
96.0%
GPT-5.4-mini
94.0%

Avoids responses that could cause harm. MindSurf's top safety score.

Crisis Detection β–² +4.5
MindSurf
95.6%
GPT-5.4-mini
91.1%

Identifies crisis situations requiring immediate attention.

Resource Provision β–² +2.0
MindSurf
88.0%
GPT-5.4-mini
86.0%

Provides appropriate helplines and crisis resources.

Emotional Safety

Safety MindSurf leads Β· +3.3 pts avg
Dependency Avoidance β–² +10.0
MindSurf
100.0%
GPT-5.4-mini
90.0%

Avoids fostering unhealthy emotional dependency on the bot.

Delusion Resistance β–² +3.3
MindSurf
88.3%
GPT-5.4-mini
85.0%

Does not reinforce delusional or distorted beliefs.

Over-refusal Avoidance β–Ό -2.0
MindSurf
88.0%
GPT-5.4-mini
90.0%

Stays helpful without excessive refusals.

Sycophancy Resistance β–² +2.0
MindSurf
90.0%
GPT-5.4-mini
88.0%

Resists simply telling the user what they want to hear.

Cultural

Context MindSurf leads Β· +3.3 pts avg
Boundary Maintenance β–² +4.0
MindSurf
90.0%
GPT-5.4-mini
86.0%

Maintains professional boundaries appropriate to the role.

Cultural Sensitivity β–Ό -2.0
MindSurf
90.0%
GPT-5.4-mini
92.0%

Responds appropriately to cultural context (es-MX).

Professional Referral β–² +8.0
MindSurf
84.0%
GPT-5.4-mini
76.0%

Refers to human professionals when appropriate.

Therapeutic

Quality MindSurf leads Β· 3 wins, 1 loss
Clinical Appropriateness β–² +6.0
MindSurf
90.0%
GPT-5.4-mini
84.0%

Clinically sound, guideline-aligned responses.

Therapeutic Actions β–² +4.0
MindSurf
88.0%
GPT-5.4-mini
84.0%

Applies appropriate therapeutic techniques.

Therapeutic Specialization β–Ό -1.6
MindSurf
81.7%
GPT-5.4-mini
83.3%

Depth of specialized therapeutic knowledge.

Empathy vs Manipulation β–² +4.3
MindSurf
84.3%
GPT-5.4-mini
80.0%

Supportive without manipulative framing.

Conversational

Quality MindSurf leads Β· 3 wins, 1 loss
Language Consistency β–² +5.0
MindSurf
91.7%
GPT-5.4-mini
86.7%

Stays consistent in the user's language and register.

Role Adherence β–² +4.0
MindSurf
88.0%
GPT-5.4-mini
84.0%

Stays within its assistant role.

Sensitive Context Reintro. β–² +1.6
MindSurf
88.3%
GPT-5.4-mini
86.7%

Carefully reintroduces sensitive prior context.

Context Retention β–Ό -2.0
MindSurf
90.0%
GPT-5.4-mini
92.0%

Maintains conversation context across turns.

Safety: The Crisis Dimension

In mental health, the Crisis dimension is non-negotiable. It is also where MindSurf's context engineering shows the clearest, most consistent lead over the baseline.

Catching the trajectory, not just the moment

Crisis detection at a single moment is necessary but not sufficient. MindSurf's biggest gain is Risk Trajectory Monitoring β€” recognising when a conversation is getting worse over time β€” where it scores 92.5% against the baseline's 75%.

The standout finding

On Risk Trajectory Monitoring, MindSurf leads by +17.5 points (92.5% vs 75%) β€” the difference between noticing a user is deteriorating and missing it.

Risk Trajectory Β· MindSurf 92.5%
Risk Trajectory Β· GPT-5.4-mini 75%
Harmful Response Β· MindSurf 96%
Harmful Response Β· GPT-5.4-mini 94%

Crisis Dimension β€” all 4 metrics

86.5%
GPT-5.4-mini β€” Crisis dimension avg
Trails by 6.5 points
+17.5
Biggest single-metric gap
Risk Trajectory Monitoring

Benchmark Methodology

EMPATH β€” an evaluation framework built specifically for emotional-support and mental health AI, in Mexican Spanish.

πŸ“Š Benchmark Overview

  • 1
    102 Scenarios
    Full EMPATH set β€” mental-health scenarios across all 5 dimensions
  • 2
    Spanish (es-MX) Locale
    Culturally appropriate testing for Mexican users
  • 3
    19 Metrics Β· 5 Dimensions
    Crisis, Therapeutic, Conversational, Emotional Safety, Cultural
  • 4
    LLM-as-Judge Scoring
    Each metric scored 0–10 by the EMPATH judge

πŸ“ The 5 Dimensions

  • 🚨
    Crisis (4)
    Detection, resources, risk trajectory, harmful-response avoidance
  • 🩺
    Therapeutic (4)
    Clinical appropriateness, actions, empathy, specialization
  • πŸ’¬
    Conversational (4)
    Context, role, language, sensitive reintroduction
  • πŸ›Ÿ
    Emotional Safety (4)
    Sycophancy, delusion, over-refusal, dependency
  • 🌎
    Cultural (3)
    Sensitivity, boundaries, professional referral

The Bottom Line

Overall scores are close β€” but in mental health, averages are not the point. MindSurf's context engineering concentrates its advantage on the safety-critical and clinical metrics that a real product cannot afford to get wrong, leading 15 of 19 EMPATH metrics and every metric in the Crisis dimension.

89.7%
Overall EMPATH Score
15/19
Metrics Won
+17.5
pts on Risk Trajectory
Learn More About MindSurf β†’