Annotate: Label VLM Training Data | Datature Vi

ANNOTATION SCHEMA

NATIVE VLM ANNOTATION MODES.

Each mode maps directly to a VLM training objective. The output schema is compatible with Qwen-VL, InternVL, Cosmos-Reason, and PaliGemma instruction-tuning formats.

Phrase Grounding

Map natural language spans to spatial coordinates. Each grounded phrase produces a bounding box with normalized [x1, y1, x2, y2] coordinates linked to a substring in the caption.

{"caption": "a red valve on the upper pipe", "grounding": [{"phrase": "red valve", "bbox": [0.15, 0.12, 0.45, 0.47]}]}

Visual Q&A

Structured question-answer pairs per image. Supports open-ended, multiple-choice, and extractive answer types. Vi validates answer grounding against visible regions.

{"question": "What fracture classification applies?", "answer": "AO 22-A3.1", "evidence_bbox": [0.25, 0.38, 0.55, 0.58]}

Chain-of-Thought

Multi-step reasoning annotations with <reasoning> tags. Each step is a discrete observation leading to a conclusion. Critical for training models that explain their predictions.

<reasoning>cortical discontinuity at mid-shaft → transverse pattern → 2mm displacement</reasoning>

Freetext Caption

Dense image descriptions without spatial constraints. Produces paragraph-level captions covering all visible elements. Used for training general visual understanding.

"Lateral radiograph of the left forearm showing intact cortical outlines of the radius with..."

Vision-Language-Action

For embodied AI and robotics pipelines. Annotate grasp affordances, spatial waypoints, and action primitives. Output schema is compatible with VLA model architectures like RT-2 and Octo.

{"action": "PICK", "target": "tube_A", "grasp_point": [0.42, 0.65], "approach_vector": [0, 0, -1]}

IntelliScribe

AI-assisted pre-annotation using the base VLM. Generates draft captions, phrase highlights, and bounding box proposals. Human annotators review and correct rather than create from scratch.

Reduces annotation time from ~4 min/image to ~45 sec/image. Configurable confidence threshold for auto-accept.

DATA PIPELINE

BRING YOUR OWN BUCKET.

Vi connects directly to your Amazon S3, Azure Blob Storage, or Google Cloud Storage buckets. Images stay in your infrastructure with zero duplication. The built-in explorer gives you filtering, progress tracking, configurable splits, and version control.

Amazon S3

IAM role-based access. S3-compatible endpoints including MinIO.

Azure Blob

SAS token auth. Hot and cool storage tiers.

Google Cloud

Service account or workload identity federation.

Formats

JPEG, PNG, TIFF, DICOM, NIfTI, MP4, AVI.

Progress Tracking

Per-asset status with color-coded indicators.

Smart Filtering

Filter by status, label count, annotator, or tags.

Train/Val/Test Split

Configurable ratios with stratified sampling.

Export

COCO JSON, Pascal VOC, YOLO TXT, custom schema.

OUTPUT FORMAT

STRUCTURED JSON. READY FOR SFT.

Every annotation produces a structured JSON object compatible with standard VLM instruction-tuning formats. No post-processing scripts needed between annotation and training.

Annotation Output Schema

{

"image_id": "xray_014.dcm",

"annotation_type": "phrase_grounding",

"caption": "Transverse fracture through...",

"grounding": [{

"phrase": "fracture site",

"bbox": [0.25, 0.38, 0.73, 0.58],

"confidence": 0.94

}],

"reasoning": "Cortical discontinuity...",

"metadata": {

"annotator": "dr.chen",

"reviewed": true,

"model_assisted": true

}

Direct SFT Compatibility

Output schema maps directly to instruction-tuning conversation format. No transformation scripts required between Vi export and training launch.

Qwen2.5-VLInternVL3.5Cosmos-Reason2PaliGemmaNVILA-LiteCustom Models

Coordinate Systems

Normalized [0,1] coordinates by default. Export supports absolute pixel coordinates, COCO-format [x, y, width, height], or Pascal VOC [xmin, ymin, xmax, ymax]. Coordinate system is configurable per-project.

Quality Assurance

Built-in validation checks: overlapping bounding boxes, empty captions, ungrounded phrases, orphaned coordinates. Annotation review workflow with approve/reject/edit states.

AI-ASSISTED LABELING

INTELLISCRIBE: MODEL-IN-THE-LOOP ANNOTATION.

IntelliScribe uses the base VLM (or your previously fine-tuned checkpoint) to pre-annotate assets. Human annotators review, correct, and approve rather than creating from scratch. The corrected annotations then feed back into the next fine-tuning iteration.

3-5x

Faster Annotation

Average time reduction from ~4 minutes per image to under 45 seconds with IntelliScribe pre-annotation enabled.

0.85+

Pre-annotation Accuracy

Typical BERTScore F1 of IntelliScribe draft captions after 2 fine-tuning iterations. Configurable acceptance threshold.

100+

Samples to Start

Minimum annotated samples needed to activate IntelliScribe for targeted, domain-specific tasks. Quality improves with each iteration.

C / P

Keyboard Shortcuts

Press C to auto-caption the current asset. Press P to auto-highlight phrase spans. Tab to cycle through suggestions.

TEAM WORKFLOWS

BUILT FOR MULTI-ANNOTATOR TEAMS.

Production annotation pipelines with role-based access, consensus protocols, and automated quality gates. Every stage is configurable per project.

Role-Based Access Control

Project Owners, Annotators, Reviewers, and Read-Only Viewers. Granular Permissions per Project. SSO via SAML 2.0 for Enterprise.

Task Queues & Assignment

Distribute Annotation Batches to Specific Team Members. Priority Queues for Urgent Assets. Configurable Daily Limits per Annotator.

Two-Stage Review

Annotate then Review. Reviewers Approve, Reject with Comments, or Edit Directly. Rejected Assets Route Back to the Annotator Queue.

Inter-Annotator Agreement

Automatic IAA Across Overlapping Assignments. Cohen's Kappa for Classification, IoU for Spatial Annotations, BERTScore for Captions.

Audit Trail & Compliance

Full Annotation History per Asset. Track Annotator, Timestamp, Diff, and Approval Status. Export Audit Logs for Regulatory Documentation.

Auto-Start Training

Configure Quality Gates that Automatically Trigger Fine-Tuning When Annotation Targets Are Met. Zero Manual Handoff Between Teams.

STRUCTURED TRAINING DATA
FOR MULTIMODAL LLMS.

NATIVE VLM ANNOTATION MODES.

Phrase Grounding

Visual Q&A

Chain-of-Thought

Freetext Caption

Vision-Language-Action

IntelliScribe

BRING YOUR OWN BUCKET.

STRUCTURED JSON. READY FOR SFT.

Direct SFT Compatibility

Coordinate Systems

Quality Assurance

INTELLISCRIBE: MODEL-IN-THE-LOOP ANNOTATION.

BUILT FOR MULTI-ANNOTATOR TEAMS.

Role-Based Access Control

Task Queues & Assignment

Two-Stage Review

Inter-Annotator Agreement

Audit Trail & Compliance

Auto-Start Training

START LABELING.
START SHIPPING.

STRUCTURED TRAINING DATAFOR MULTIMODAL LLMS.

NATIVE VLM ANNOTATION MODES.

Phrase Grounding

Visual Q&A

Chain-of-Thought

Freetext Caption

Vision-Language-Action

IntelliScribe

BRING YOUR OWN BUCKET.

STRUCTURED JSON. READY FOR SFT.

Direct SFT Compatibility

Coordinate Systems

Quality Assurance

INTELLISCRIBE: MODEL-IN-THE-LOOP ANNOTATION.

BUILT FOR MULTI-ANNOTATOR TEAMS.

Role-Based Access Control

Task Queues & Assignment

Two-Stage Review

Inter-Annotator Agreement

Audit Trail & Compliance

Auto-Start Training

START LABELING.START SHIPPING.

STRUCTURED TRAINING DATA
FOR MULTIMODAL LLMS.

START LABELING.
START SHIPPING.