PLATFORM // ANNOTATE
Purpose-built annotation tooling for VLM supervised fine-tuning. Generate phrase-grounded captions, visual Q&A pairs, chain-of-thought reasoning labels, and vision-language-action sequences at scale.
ANNOTATION SCHEMA
Each mode maps directly to a VLM training objective. The output schema is compatible with Qwen-VL, InternVL, Cosmos-Reason, and PaliGemma instruction-tuning formats.
Map natural language spans to spatial coordinates. Each grounded phrase produces a bounding box with normalized [x1, y1, x2, y2] coordinates linked to a substring in the caption.
{"caption": "a red valve on the upper pipe", "grounding": [{"phrase": "red valve", "bbox": [0.15, 0.12, 0.45, 0.47]}]}
Structured question-answer pairs per image. Supports open-ended, multiple-choice, and extractive answer types. Vi validates answer grounding against visible regions.
{"question": "What fracture classification applies?", "answer": "AO 22-A3.1", "evidence_bbox": [0.25, 0.38, 0.55, 0.58]}
Multi-step reasoning annotations with <reasoning> tags. Each step is a discrete observation leading to a conclusion. Critical for training models that explain their predictions.
<reasoning>cortical discontinuity at mid-shaft → transverse pattern → 2mm displacement</reasoning>
Dense image descriptions without spatial constraints. Produces paragraph-level captions covering all visible elements. Used for training general visual understanding.
"Lateral radiograph of the left forearm showing intact cortical outlines of the radius with..."
For embodied AI and robotics pipelines. Annotate grasp affordances, spatial waypoints, and action primitives. Output schema is compatible with VLA model architectures like RT-2 and Octo.
{"action": "PICK", "target": "tube_A", "grasp_point": [0.42, 0.65], "approach_vector": [0, 0, -1]}
AI-assisted pre-annotation using the base VLM. Generates draft captions, phrase highlights, and bounding box proposals. Human annotators review and correct rather than create from scratch.
Reduces annotation time from ~4 min/image to ~45 sec/image. Configurable confidence threshold for auto-accept.
DATA PIPELINE
Vi connects directly to your Amazon S3, Azure Blob Storage, or Google Cloud Storage buckets. Images stay in your infrastructure with zero duplication. The built-in explorer gives you filtering, progress tracking, configurable splits, and version control.
Amazon S3
IAM role-based access. S3-compatible endpoints including MinIO.
Azure Blob
SAS token auth. Hot and cool storage tiers.
Google Cloud
Service account or workload identity federation.
Formats
JPEG, PNG, TIFF, DICOM, NIfTI, MP4, AVI.
Progress Tracking
Per-asset status with color-coded indicators.
Smart Filtering
Filter by status, label count, annotator, or tags.
Train/Val/Test Split
Configurable ratios with stratified sampling.
Export
COCO JSON, Pascal VOC, YOLO TXT, custom schema.
OUTPUT FORMAT
Every annotation produces a structured JSON object compatible with standard VLM instruction-tuning formats. No post-processing scripts needed between annotation and training.
Annotation Output Schema
{
"image_id": "xray_014.dcm",
"annotation_type": "phrase_grounding",
"caption": "Transverse fracture through...",
"grounding": [{
"phrase": "fracture site",
"bbox": [0.25, 0.38, 0.73, 0.58],
"confidence": 0.94
}],
"reasoning": "Cortical discontinuity...",
"metadata": {
"annotator": "dr.chen",
"reviewed": true,
"model_assisted": true
}
}
Output schema maps directly to instruction-tuning conversation format. No transformation scripts required between Vi export and training launch.
Normalized [0,1] coordinates by default. Export supports absolute pixel coordinates, COCO-format [x, y, width, height], or Pascal VOC [xmin, ymin, xmax, ymax]. Coordinate system is configurable per-project.
Built-in validation checks: overlapping bounding boxes, empty captions, ungrounded phrases, orphaned coordinates. Annotation review workflow with approve/reject/edit states.
AI-ASSISTED LABELING
IntelliScribe uses the base VLM (or your previously fine-tuned checkpoint) to pre-annotate assets. Human annotators review, correct, and approve rather than creating from scratch. The corrected annotations then feed back into the next fine-tuning iteration.
Faster Annotation
Average time reduction from ~4 minutes per image to under 45 seconds with IntelliScribe pre-annotation enabled.
Pre-annotation Accuracy
Typical BERTScore F1 of IntelliScribe draft captions after 2 fine-tuning iterations. Configurable acceptance threshold.
Samples to Start
Minimum annotated samples needed to activate IntelliScribe for targeted, domain-specific tasks. Quality improves with each iteration.
Keyboard Shortcuts
Press C to auto-caption the current asset. Press P to auto-highlight phrase spans. Tab to cycle through suggestions.
TEAM WORKFLOWS
Production annotation pipelines with role-based access, consensus protocols, and automated quality gates. Every stage is configurable per project.
Project Owners, Annotators, Reviewers, and Read-Only Viewers. Granular Permissions per Project. SSO via SAML 2.0 for Enterprise.
Distribute Annotation Batches to Specific Team Members. Priority Queues for Urgent Assets. Configurable Daily Limits per Annotator.
Annotate then Review. Reviewers Approve, Reject with Comments, or Edit Directly. Rejected Assets Route Back to the Annotator Queue.
Automatic IAA Across Overlapping Assignments. Cohen's Kappa for Classification, IoU for Spatial Annotations, BERTScore for Captions.
Full Annotation History per Asset. Track Annotator, Timestamp, Diff, and Approval Status. Export Audit Logs for Regulatory Documentation.
Configure Quality Gates that Automatically Trigger Fine-Tuning When Annotation Targets Are Met. Zero Manual Handoff Between Teams.
3,000 data rows free. All annotation modes included. No credit card required.