Benchmark Report 2026

MindSurf Outperforms ChatGPT-5.1 in Mental Health AI

Our fine-tuned model achieves 90.5% overall score with perfect safety metrics, demonstrating the power of specialized training for mental health applications.

GPT-4o-mini
0%
OpenAI Direct
GPT-5.1
0%
OpenAI Direct
πŸ“ˆ
0
Over ChatGPT-5.1 (OpenAI)
Higher overall benchmark score
πŸš€
0
Over ChatGPT-4o-mini (OpenAI)
Significant performance improvement
πŸ›‘οΈ
0
Safety Score
Perfect crisis detection rate
⚠️
0
Fewer Problems
Compared to ChatGPT-5.1 (OpenAI) (3 vs 7)

Overall Performance Comparison

Comprehensive benchmark testing across 300 test cases in Spanish (es-MX) evaluating safety, therapeutic value, and conversational quality.

MindSurf (fine-tuning:gpt-4.1-mini)
ChatGPT-5.1 (OpenAI)
ChatGPT-4o-mini (OpenAI)

Metric-by-Metric Breakdown

Detailed analysis of 6 key metrics across safety and quality dimensions, showing MindSurf's consistent superior performance.

Crisis Detection (CDR) Safety
MindSurf
100%
GPT-5.1
100%
GPT-4o
25%

Measures ability to detect crisis situations requiring immediate intervention.

Target achieved: 100%
Resource Provision (RPR) Safety
MindSurf
100%
GPT-5.1
100%
GPT-4o
100%

Evaluates provision of appropriate crisis resources and helpline information.

Target achieved: 100%
Response Length Quality
MindSurf
93.3%
GPT-5.1
80%
GPT-4o
56.7%

Measures appropriate response length within therapeutic guidelines.

Exceeds target: 80%
Context Retention Quality
MindSurf
97.2%
GPT-5.1
97.2%
GPT-4o
100%

Evaluates ability to maintain conversation context across multiple turns.

Exceeds target: 90%
BERTScore F1 Quality
MindSurf
67.3%
GPT-5.1
62.7%
GPT-4o
63.6%

Semantic similarity measure using BERT embeddings for response quality.

Exceeds target: 65%
Diversity Quality
MindSurf
66.3%
GPT-5.1
48%
GPT-4o
42.1%

Measures vocabulary diversity and avoidance of repetitive patterns.

Exceeds target: 50%

Safety: The Critical Differentiator

In mental health applications, safety is non-negotiable. MindSurf achieves perfect scores where ChatGPT-4o-mini critically fails.

Perfect Crisis Detection

MindSurf correctly identifies 100% of crisis situations, including suicidal ideation, self-harm indicators, and acute distress signals. This is critical for mental health applications where missing a crisis could have severe consequences.

Critical Finding

ChatGPT-4o-mini failed to detect 75% of crisis situations in our benchmark, including 3 critical failures that could have resulted in harm in a real-world scenario.

MindSurf 100%
ChatGPT-5.1 (OpenAI) 100%
ChatGPT-4o-mini (OpenAI) 25%

Safety Metrics Comparison

7
ChatGPT-5.1 (OpenAI) Problems
0 critical failures
16
ChatGPT-4o-mini (OpenAI) Problems
3 critical failures

Quality Analysis

Beyond safety, MindSurf's fine-tuning delivers superior response quality across all measured dimensions.

Benchmark Methodology

Rigorous evaluation framework designed specifically for mental health AI applications.

πŸ“Š Benchmark Overview

  • 1
    300 Test Cases
    Comprehensive coverage of mental health scenarios
  • 2
    Spanish (es-MX) Locale
    Culturally appropriate testing for Latin American users
  • 3
    3 Categories
    Safety Critical, Therapeutic Value, Conversational Quality
  • 4
    Dual Validation
    Algorithmic metrics + LLM-as-Judge evaluation

πŸ“ Evaluated Metrics

  • CDR
    Crisis Detection Rate
    Identifies crisis situations requiring intervention
  • RPR
    Resource Provision Rate
    Provides appropriate helpline and resource information
  • BS
    BERTScore F1
    Semantic similarity using BERT embeddings
  • DIV
    Response Diversity
    Vocabulary richness and pattern variation
  • CTX
    Context Retention
    Maintains conversation context across turns
  • LEN
    Response Length
    Appropriate length within therapeutic guidelines

The Bottom Line

MindSurf's specialized fine-tuning demonstrates that domain-specific optimization is essential for mental health AI applications. Our model not only outperforms larger general-purpose models but does so with perfect safety metrics.

90.5%
Overall Score
100%
Safety Score
0
Critical Failures
Learn More About MindSurf β†’