Evaluate: Compare VLM Checkpoints Side by Side | Datature Vi

METRICS SUITE

10+ METRICS FOR VLM GENERATION QUALITY.

Every metric is computed automatically at each checkpoint. Track text generation quality, spatial accuracy, and semantic alignment in a single dashboard.

BERTScore F1

0.89

Contextual embedding similarity between generated and reference text. Captures semantic meaning beyond exact word overlap. Computed using a frozen BERT encoder.

BLEU-4

0.71

N-gram precision with brevity penalty. Measures how many 4-gram sequences in the generated output match the reference. Standard metric for machine translation and captioning.

CIDEr

1.42

Consensus-based image description evaluation. Weights n-grams by TF-IDF to reward informative, discriminative descriptions over generic ones.

ROUGE-L

0.82

Longest common subsequence between prediction and reference. Measures recall-oriented fluency. Captures sentence-level structure without requiring consecutive matches.

CLIPScore

0.78

Vision-language alignment score. Measures cosine similarity between CLIP embeddings of the generated text and the input image. Evaluates grounding quality.

VQA Accuracy

0.91

Exact-match and soft-match accuracy on visual question answering tasks. Uses the VQA 2.0 evaluation protocol with 10-way human agreement normalization.

Bounding Box F1

0.94

Harmonic mean of precision and recall for spatial coordinate predictions. Matches predictions to ground truth using IoU threshold of 0.5.

IoU (Intersection over Union)

0.86

Spatial overlap between predicted and ground truth bounding boxes. Averaged across all matched pairs. Threshold-free measure of localization accuracy.

Precision / Recall

0.96 / 0.92

Precision measures false positive rate for both text tokens and spatial predictions. Recall measures coverage of ground truth elements. Tracked independently per task type.

mAP (Mean Average Precision)

0.88

Area under the precision-recall curve, averaged across all classes. Evaluated at IoU thresholds from 0.5 to 0.95 in 0.05 increments (COCO-style mAP).

Perplexity

12.4

Exponentiated cross-entropy loss. Measures how confidently the model predicts the next token. Lower is better. Tracked per epoch alongside loss curves.

METEOR

0.76

Harmonic mean of unigram precision and recall with stemming, synonymy, and paraphrase matching. More linguistically aware than BLEU for caption evaluation.

CHECKPOINT SCRUBBER

WATCH PREDICTIONS IMPROVE.

Scrub through every saved checkpoint and see side-by-side ground truth versus model predictions. Observe how output quality evolves across training epochs. Identify the exact checkpoint where your model reaches production quality.

▸
Side-by-Side View
Ground truth on the left, model prediction on the right. Both text and spatial outputs are rendered in the same coordinate frame for direct comparison.
▸
Epoch Scrubber
Drag the epoch slider to jump between checkpoints. Predictions update in real time. Each checkpoint stores the full output for every validation sample.
▸
Per-Sample Drill-Down
Click any validation sample to inspect the complete prediction history across all checkpoints. Identify failure modes at the individual sample level.
▸
Diff Highlighting
Automatic text diff between prediction and ground truth. Insertions, deletions, and substitutions are color-coded for rapid visual inspection.

Ground Truth

fracture_site

displacement

Transverse fracture, mid-diaphysis ulna, 2mm displacement. AO: 22-A3.1.

Prediction (Epoch 18)

fracture_site

displacement

Transverse fracture through mid-diaphysis of ulna with 2mm lateral displacement. AO: 22-A3.1.

LOSS VISUALIZATION

TRAINING VS VALIDATION LOSS. IN REAL TIME.

Monitor convergence behavior as it happens. Both training and validation loss curves are plotted on the same axis. Overfitting detection and early stopping recommendations are computed automatically.

1.82→0.42

Training Loss

Starting loss of 1.82 converges to 0.42 over 20 epochs. Smooth monotonic decrease indicates stable learning rate schedule and proper data shuffling.

1.85→0.68

Validation Loss

Validation loss converges to 0.68 and plateaus around epoch 14. The gap between train and val loss is monitored for overfitting signals.

Auto

Overfitting Detection

Automatic divergence alert when validation loss increases for 3 consecutive epochs while training loss continues to decrease. Configurable patience parameter.

Epoch 14

Early Stopping

Recommended early stop point based on validation loss plateau detection. Saves the best checkpoint automatically and notifies when further training yields diminishing returns.

RUN COMPARISON

A/B TEST TRAINING CONFIGURATIONS.

Compare two training runs side by side. See how different hyperparameters, data mixes, or model architectures affect every metric. Make data-driven decisions about which configuration to deploy.

Metric

Run A: LoRA r=16, lr=2e-4

Run B: Full FT, lr=5e-5

BERTScore F1

0.89

0.91BEST

BLEU-4

0.71BEST

0.68

Bounding Box F1

0.94BEST

0.92

ROUGE-L

0.82

0.84BEST

Val Loss (Final)

0.68

0.61BEST

Training Time

2h 14mBEST

8h 47m

GPU Memory

24 GBBEST

78 GB

Trainable Params

18.4MBEST

7.6B

Metric Overlay

Plot the same metric from both runs on a single chart. Spot divergence points and identify which configuration converges faster.

Confusion Matrix Diff

Compare per-class confusion matrices between runs. See which classes improved and which regressed when changing hyperparameters.

Statistical Significance

Bootstrap confidence intervals on metric differences. Know whether Run B is genuinely better or if the improvement is within noise.

EXPORT & REPORTS

EXPORTABLE EVALUATION ARTIFACTS.

Download complete evaluation reports in JSON or CSV. Share results with stakeholders, feed metrics into CI/CD pipelines, or archive for audit trails. Every metric, every checkpoint, fully reproducible.

▸
JSON Export
Complete metric payload per checkpoint: all scores, per-sample predictions, confusion matrices, and loss history. Machine-readable for pipeline integration.
▸
CSV Export
Flat table format with one row per checkpoint per metric. Import directly into spreadsheets, Jupyter notebooks, or BI dashboards for custom analysis.
▸
Automated Reports
Scheduled HTML report generation at training completion. Includes summary statistics, top failure cases, metric trends, and deployment readiness score.
▸
Best-Checkpoint Selection
Configurable metric thresholds determine the recommended checkpoint. Set minimum F1, maximum val loss, or custom composite scoring functions.

Export Schema

{

"run_id": "run_47_cosmos_reason_7b",

"best_checkpoint": 18,

"selection_metric": "bertscore_f1",

"metrics": {

"bertscore_f1": 0.89,

"bleu_4": 0.71,

"bbox_f1": 0.94,

"val_loss": 0.68

"thresholds": {

"min_f1": 0.85,

"max_val_loss": 0.75

}

CI/CD Integration

Evaluation reports are accessible via the Vi API. Integrate metric checks into your deployment pipeline. Gate production releases on minimum metric thresholds.

GitHub ActionsGitLab CIJenkinsREST APIPython SDKWebhooks

Confusion Matrix

Per-class confusion matrix computed at every checkpoint. Identify which object categories or answer types the model struggles with. Exportable as both visual heatmap and raw CSV.

COMPREHENSIVE VLM
EVALUATION AT EVERY CHECKPOINT.

10+ METRICS FOR VLM GENERATION QUALITY.

BERTScore F1

BLEU-4

CIDEr

ROUGE-L

CLIPScore

VQA Accuracy

Bounding Box F1

IoU (Intersection over Union)

Precision / Recall

mAP (Mean Average Precision)

Perplexity

METEOR

WATCH PREDICTIONS IMPROVE.

TRAINING VS VALIDATION LOSS. IN REAL TIME.

A/B TEST TRAINING CONFIGURATIONS.

Metric Overlay

Confusion Matrix Diff

Statistical Significance

EXPORTABLE EVALUATION ARTIFACTS.

CI/CD Integration

Confusion Matrix

KNOW WHEN YOUR MODEL
IS PRODUCTION READY.

COMPREHENSIVE VLMEVALUATION AT EVERY CHECKPOINT.

10+ METRICS FOR VLM GENERATION QUALITY.

BERTScore F1

BLEU-4

CIDEr

ROUGE-L

CLIPScore

VQA Accuracy

Bounding Box F1

IoU (Intersection over Union)

Precision / Recall

mAP (Mean Average Precision)

Perplexity

METEOR

WATCH PREDICTIONS IMPROVE.

TRAINING VS VALIDATION LOSS. IN REAL TIME.

A/B TEST TRAINING CONFIGURATIONS.

Metric Overlay

Confusion Matrix Diff

Statistical Significance

EXPORTABLE EVALUATION ARTIFACTS.

CI/CD Integration

Confusion Matrix

KNOW WHEN YOUR MODELIS PRODUCTION READY.

COMPREHENSIVE VLM
EVALUATION AT EVERY CHECKPOINT.

KNOW WHEN YOUR MODEL
IS PRODUCTION READY.