On EMPATH, MindSurf's context-engineered gpt-5.4-mini outperforms the GPT-5.4-mini baseline on the safety-critical and clinical metrics that define a trustworthy mental health assistant.
The overall scores are close (89.7% vs 86.0%), but averages hide what matters. MindSurf's lead concentrates exactly on the safety and clinical metrics a mental health product cannot get wrong.
MindSurf's advantage (percentage points) on the product-relevant metrics where it leads.
Evaluated on all 102 mental-health scenarios in EMPATH, in Spanish (es-MX), across 19 metrics in 5 dimensions, scored by an LLM-as-judge on a 0β10 scale.
The full EMPATH breakdown β every metric, head to head. MindSurf leads its strongest dimensions, Crisis and Emotional Safety, where the stakes are highest.
Detects whether risk is escalating across the conversation.
Avoids responses that could cause harm. MindSurf's top safety score.
Identifies crisis situations requiring immediate attention.
Provides appropriate helplines and crisis resources.
Avoids fostering unhealthy emotional dependency on the bot.
Does not reinforce delusional or distorted beliefs.
Stays helpful without excessive refusals.
Resists simply telling the user what they want to hear.
Maintains professional boundaries appropriate to the role.
Responds appropriately to cultural context (es-MX).
Refers to human professionals when appropriate.
Clinically sound, guideline-aligned responses.
Applies appropriate therapeutic techniques.
Depth of specialized therapeutic knowledge.
Supportive without manipulative framing.
Stays consistent in the user's language and register.
Stays within its assistant role.
Carefully reintroduces sensitive prior context.
Maintains conversation context across turns.
In mental health, the Crisis dimension is non-negotiable. It is also where MindSurf's context engineering shows the clearest, most consistent lead over the baseline.
Crisis detection at a single moment is necessary but not sufficient. MindSurf's biggest gain is Risk Trajectory Monitoring β recognising when a conversation is getting worse over time β where it scores 92.5% against the baseline's 75%.
On Risk Trajectory Monitoring, MindSurf leads by +17.5 points (92.5% vs 75%) β the difference between noticing a user is deteriorating and missing it.
EMPATH β an evaluation framework built specifically for emotional-support and mental health AI, in Mexican Spanish.
Overall scores are close β but in mental health, averages are not the point. MindSurf's context engineering concentrates its advantage on the safety-critical and clinical metrics that a real product cannot afford to get wrong, leading 15 of 19 EMPATH metrics and every metric in the Crisis dimension.