PLATFORM // EVALUATE

COMPREHENSIVE VLM
EVALUATION AT EVERY CHECKPOINT.

Track loss curves, measure generation quality with 10+ metrics, scrub through checkpoints to watch predictions improve, and compare training runs side by side. Know exactly when your model is ready.

Datature/Cosmos-Reason2 8B/Evaluation
Run #47● Training Complete
MetricsAdvancedLogs
Epoch 18/20|4x NVIDIA A100
Loss / Total Loss
Train LossVal Loss
2.01.51.00.50.0
train: 0.42val: 0.68
02004006008001000
Epoch
18 / 20

METRICS SUITE

10+ METRICS FOR VLM GENERATION QUALITY.

Every metric is computed automatically at each checkpoint. Track text generation quality, spatial accuracy, and semantic alignment in a single dashboard.

BERTScore F1

0.89

Contextual embedding similarity between generated and reference text. Captures semantic meaning beyond exact word overlap. Computed using a frozen BERT encoder.

BLEU-4

0.71

N-gram precision with brevity penalty. Measures how many 4-gram sequences in the generated output match the reference. Standard metric for machine translation and captioning.

CIDEr

1.42

Consensus-based image description evaluation. Weights n-grams by TF-IDF to reward informative, discriminative descriptions over generic ones.

ROUGE-L

0.82

Longest common subsequence between prediction and reference. Measures recall-oriented fluency. Captures sentence-level structure without requiring consecutive matches.

CLIPScore

0.78

Vision-language alignment score. Measures cosine similarity between CLIP embeddings of the generated text and the input image. Evaluates grounding quality.

VQA Accuracy

0.91

Exact-match and soft-match accuracy on visual question answering tasks. Uses the VQA 2.0 evaluation protocol with 10-way human agreement normalization.

Bounding Box F1

0.94

Harmonic mean of precision and recall for spatial coordinate predictions. Matches predictions to ground truth using IoU threshold of 0.5.

IoU (Intersection over Union)

0.86

Spatial overlap between predicted and ground truth bounding boxes. Averaged across all matched pairs. Threshold-free measure of localization accuracy.

Precision / Recall

0.96 / 0.92

Precision measures false positive rate for both text tokens and spatial predictions. Recall measures coverage of ground truth elements. Tracked independently per task type.

mAP (Mean Average Precision)

0.88

Area under the precision-recall curve, averaged across all classes. Evaluated at IoU thresholds from 0.5 to 0.95 in 0.05 increments (COCO-style mAP).

Perplexity

12.4

Exponentiated cross-entropy loss. Measures how confidently the model predicts the next token. Lower is better. Tracked per epoch alongside loss curves.

METEOR

0.76

Harmonic mean of unigram precision and recall with stemming, synonymy, and paraphrase matching. More linguistically aware than BLEU for caption evaluation.

CHECKPOINT SCRUBBER

WATCH PREDICTIONS IMPROVE.

Scrub through every saved checkpoint and see side-by-side ground truth versus model predictions. Observe how output quality evolves across training epochs. Identify the exact checkpoint where your model reaches production quality.

  • Side-by-Side View

    Ground truth on the left, model prediction on the right. Both text and spatial outputs are rendered in the same coordinate frame for direct comparison.

  • Epoch Scrubber

    Drag the epoch slider to jump between checkpoints. Predictions update in real time. Each checkpoint stores the full output for every validation sample.

  • Per-Sample Drill-Down

    Click any validation sample to inspect the complete prediction history across all checkpoints. Identify failure modes at the individual sample level.

  • Diff Highlighting

    Automatic text diff between prediction and ground truth. Insertions, deletions, and substitutions are color-coded for rapid visual inspection.

Ground Truth
fracture_site
displacement

Transverse fracture, mid-diaphysis ulna, 2mm displacement. AO: 22-A3.1.

Prediction (Epoch 18)
fracture_site
displacement

Transverse fracture through mid-diaphysis of ulna with 2mm lateral displacement. AO: 22-A3.1.

LOSS VISUALIZATION

TRAINING VS VALIDATION LOSS. IN REAL TIME.

Monitor convergence behavior as it happens. Both training and validation loss curves are plotted on the same axis. Overfitting detection and early stopping recommendations are computed automatically.

1.820.42

Training Loss

Starting loss of 1.82 converges to 0.42 over 20 epochs. Smooth monotonic decrease indicates stable learning rate schedule and proper data shuffling.

1.850.68

Validation Loss

Validation loss converges to 0.68 and plateaus around epoch 14. The gap between train and val loss is monitored for overfitting signals.

Auto

Overfitting Detection

Automatic divergence alert when validation loss increases for 3 consecutive epochs while training loss continues to decrease. Configurable patience parameter.

Epoch 14

Early Stopping

Recommended early stop point based on validation loss plateau detection. Saves the best checkpoint automatically and notifies when further training yields diminishing returns.

RUN COMPARISON

A/B TEST TRAINING CONFIGURATIONS.

Compare two training runs side by side. See how different hyperparameters, data mixes, or model architectures affect every metric. Make data-driven decisions about which configuration to deploy.

Metric
Run A: LoRA r=16, lr=2e-4
Run B: Full FT, lr=5e-5
BERTScore F1
0.89
0.91BEST
BLEU-4
0.71BEST
0.68
Bounding Box F1
0.94BEST
0.92
ROUGE-L
0.82
0.84BEST
Val Loss (Final)
0.68
0.61BEST
Training Time
2h 14mBEST
8h 47m
GPU Memory
24 GBBEST
78 GB
Trainable Params
18.4MBEST
7.6B

Metric Overlay

Plot the same metric from both runs on a single chart. Spot divergence points and identify which configuration converges faster.

Confusion Matrix Diff

Compare per-class confusion matrices between runs. See which classes improved and which regressed when changing hyperparameters.

Statistical Significance

Bootstrap confidence intervals on metric differences. Know whether Run B is genuinely better or if the improvement is within noise.

EXPORT & REPORTS

EXPORTABLE EVALUATION ARTIFACTS.

Download complete evaluation reports in JSON or CSV. Share results with stakeholders, feed metrics into CI/CD pipelines, or archive for audit trails. Every metric, every checkpoint, fully reproducible.

  • JSON Export

    Complete metric payload per checkpoint: all scores, per-sample predictions, confusion matrices, and loss history. Machine-readable for pipeline integration.

  • CSV Export

    Flat table format with one row per checkpoint per metric. Import directly into spreadsheets, Jupyter notebooks, or BI dashboards for custom analysis.

  • Automated Reports

    Scheduled HTML report generation at training completion. Includes summary statistics, top failure cases, metric trends, and deployment readiness score.

  • Best-Checkpoint Selection

    Configurable metric thresholds determine the recommended checkpoint. Set minimum F1, maximum val loss, or custom composite scoring functions.

Export Schema

{

"run_id": "run_47_cosmos_reason_7b",

"best_checkpoint": 18,

"selection_metric": "bertscore_f1",

"metrics": {

"bertscore_f1": 0.89,

"bleu_4": 0.71,

"bbox_f1": 0.94,

"val_loss": 0.68

},

"thresholds": {

"min_f1": 0.85,

"max_val_loss": 0.75

}

}

CI/CD Integration

Evaluation reports are accessible via the Vi API. Integrate metric checks into your deployment pipeline. Gate production releases on minimum metric thresholds.

GitHub ActionsGitLab CIJenkinsREST APIPython SDKWebhooks

Confusion Matrix

Per-class confusion matrix computed at every checkpoint. Identify which object categories or answer types the model struggles with. Exportable as both visual heatmap and raw CSV.

KNOW WHEN YOUR MODEL
IS PRODUCTION READY.

Comprehensive evaluation suite included on every plan. Start free today.