PLATFORM // EVALUATE
Track loss curves, measure generation quality with 10+ metrics, scrub through checkpoints to watch predictions improve, and compare training runs side by side. Know exactly when your model is ready.
METRICS SUITE
Every metric is computed automatically at each checkpoint. Track text generation quality, spatial accuracy, and semantic alignment in a single dashboard.
Contextual embedding similarity between generated and reference text. Captures semantic meaning beyond exact word overlap. Computed using a frozen BERT encoder.
N-gram precision with brevity penalty. Measures how many 4-gram sequences in the generated output match the reference. Standard metric for machine translation and captioning.
Consensus-based image description evaluation. Weights n-grams by TF-IDF to reward informative, discriminative descriptions over generic ones.
Longest common subsequence between prediction and reference. Measures recall-oriented fluency. Captures sentence-level structure without requiring consecutive matches.
Vision-language alignment score. Measures cosine similarity between CLIP embeddings of the generated text and the input image. Evaluates grounding quality.
Exact-match and soft-match accuracy on visual question answering tasks. Uses the VQA 2.0 evaluation protocol with 10-way human agreement normalization.
Harmonic mean of precision and recall for spatial coordinate predictions. Matches predictions to ground truth using IoU threshold of 0.5.
Spatial overlap between predicted and ground truth bounding boxes. Averaged across all matched pairs. Threshold-free measure of localization accuracy.
Precision measures false positive rate for both text tokens and spatial predictions. Recall measures coverage of ground truth elements. Tracked independently per task type.
Area under the precision-recall curve, averaged across all classes. Evaluated at IoU thresholds from 0.5 to 0.95 in 0.05 increments (COCO-style mAP).
Exponentiated cross-entropy loss. Measures how confidently the model predicts the next token. Lower is better. Tracked per epoch alongside loss curves.
Harmonic mean of unigram precision and recall with stemming, synonymy, and paraphrase matching. More linguistically aware than BLEU for caption evaluation.
CHECKPOINT SCRUBBER
Scrub through every saved checkpoint and see side-by-side ground truth versus model predictions. Observe how output quality evolves across training epochs. Identify the exact checkpoint where your model reaches production quality.
Side-by-Side View
Ground truth on the left, model prediction on the right. Both text and spatial outputs are rendered in the same coordinate frame for direct comparison.
Epoch Scrubber
Drag the epoch slider to jump between checkpoints. Predictions update in real time. Each checkpoint stores the full output for every validation sample.
Per-Sample Drill-Down
Click any validation sample to inspect the complete prediction history across all checkpoints. Identify failure modes at the individual sample level.
Diff Highlighting
Automatic text diff between prediction and ground truth. Insertions, deletions, and substitutions are color-coded for rapid visual inspection.
Transverse fracture, mid-diaphysis ulna, 2mm displacement. AO: 22-A3.1.
Transverse fracture through mid-diaphysis of ulna with 2mm lateral displacement. AO: 22-A3.1.
LOSS VISUALIZATION
Monitor convergence behavior as it happens. Both training and validation loss curves are plotted on the same axis. Overfitting detection and early stopping recommendations are computed automatically.
Training Loss
Starting loss of 1.82 converges to 0.42 over 20 epochs. Smooth monotonic decrease indicates stable learning rate schedule and proper data shuffling.
Validation Loss
Validation loss converges to 0.68 and plateaus around epoch 14. The gap between train and val loss is monitored for overfitting signals.
Overfitting Detection
Automatic divergence alert when validation loss increases for 3 consecutive epochs while training loss continues to decrease. Configurable patience parameter.
Early Stopping
Recommended early stop point based on validation loss plateau detection. Saves the best checkpoint automatically and notifies when further training yields diminishing returns.
RUN COMPARISON
Compare two training runs side by side. See how different hyperparameters, data mixes, or model architectures affect every metric. Make data-driven decisions about which configuration to deploy.
Plot the same metric from both runs on a single chart. Spot divergence points and identify which configuration converges faster.
Compare per-class confusion matrices between runs. See which classes improved and which regressed when changing hyperparameters.
Bootstrap confidence intervals on metric differences. Know whether Run B is genuinely better or if the improvement is within noise.
EXPORT & REPORTS
Download complete evaluation reports in JSON or CSV. Share results with stakeholders, feed metrics into CI/CD pipelines, or archive for audit trails. Every metric, every checkpoint, fully reproducible.
JSON Export
Complete metric payload per checkpoint: all scores, per-sample predictions, confusion matrices, and loss history. Machine-readable for pipeline integration.
CSV Export
Flat table format with one row per checkpoint per metric. Import directly into spreadsheets, Jupyter notebooks, or BI dashboards for custom analysis.
Automated Reports
Scheduled HTML report generation at training completion. Includes summary statistics, top failure cases, metric trends, and deployment readiness score.
Best-Checkpoint Selection
Configurable metric thresholds determine the recommended checkpoint. Set minimum F1, maximum val loss, or custom composite scoring functions.
Export Schema
{
"run_id": "run_47_cosmos_reason_7b",
"best_checkpoint": 18,
"selection_metric": "bertscore_f1",
"metrics": {
"bertscore_f1": 0.89,
"bleu_4": 0.71,
"bbox_f1": 0.94,
"val_loss": 0.68
},
"thresholds": {
"min_f1": 0.85,
"max_val_loss": 0.75
}
}
Evaluation reports are accessible via the Vi API. Integrate metric checks into your deployment pipeline. Gate production releases on minimum metric thresholds.
Per-class confusion matrix computed at every checkpoint. Identify which object categories or answer types the model struggles with. Exportable as both visual heatmap and raw CSV.
Comprehensive evaluation suite included on every plan. Start free today.