Post-Training: Align VLMs with DPO | Datature Vi

HOW IT WORKS

FROM PREFERENCES TO ALIGNMENT IN THREE STEPS.

DPO eliminates the multi-stage complexity of traditional RLHF. Collect human preferences, run a single optimization pass, and evaluate alignment. No separate reward model training needed.

Step 01

Collect Preferences

Domain experts review side-by-side model outputs for the same prompt. They mark which response is better: more grounded, more accurate, less hallucinated. Each pair becomes a (chosen, rejected) training signal.

1K-10K preference pairs per iteration

Step 02

Train with DPO

The DPO loss function directly optimizes the policy to increase the likelihood of chosen responses relative to rejected ones. Controlled by a beta parameter that balances alignment strength against divergence from the reference model.

Single-stage optimization, no PPO

Step 03

Evaluate Alignment

Run the updated model on held-out prompts and measure preference accuracy against expert labels. Track reward gap convergence and KL divergence to ensure the model improves without catastrophic drift.

Target: >65% preference agreement

PREFERENCE INTERFACE

SIDE-BY-SIDE EXPERT REVIEW.

Domain experts see two model responses for the same image and prompt. They select the response that is more accurate, better grounded, and less hallucinated. Each decision produces a (chosen, rejected) pair for DPO training.

▸
Blind Comparison
Responses are randomized left/right to prevent position bias. Model checkpoint identifiers are hidden from reviewers.
▸
Multi-Criteria Scoring
Optional rubric mode: rate each response on accuracy, spatial grounding, reasoning depth, and hallucination severity before selecting a winner.
▸
Confidence Flagging
Reviewers can flag pairs as 'too close to call' for secondary review. Low-confidence pairs are prioritized for additional annotator overlap.
▸
Keyboard Shortcuts
Press A to choose left, B to choose right, S to skip, F to flag. Annotators process 60-80 pairs per hour at steady state.

DATA MANAGEMENT

PREFERENCE DATASET OPS.

Manage preference pairs like any other dataset artifact. Version, filter, and export preference data for reproducible DPO training runs.

▸
Pair Explorer
Browse all collected preference pairs with filtering by prompt category, confidence score, annotator, and date range.
▸
Agreement Metrics
Track inter-annotator agreement on overlapping pairs. Cohen's Kappa and percentage agreement displayed per batch and overall.
▸
Export Formats
Export as DPO-ready JSON with chosen/rejected fields, or as raw preference triples (prompt, chosen, rejected) for custom training scripts.
▸
Version Control
Each DPO iteration snapshots the preference dataset. Compare pair distributions across iterations to track coverage drift.

REWARD SIGNALS

IMPLICIT REWARD MODELING.

DPO uses the language model itself as an implicit reward model. The difference in log-probabilities between chosen and rejected responses defines the reward signal. No separate reward model training required.

0.81

Reward Gap

Average log-probability difference between chosen and rejected responses after 3 DPO iterations. Higher means stronger alignment signal.

82%

Preference Accuracy

Fraction of held-out preference pairs where the DPO-trained model assigns higher probability to the expert-chosen response.

0.25

KL Divergence

Measures how far the aligned model has drifted from the reference. Controlled by beta parameter to prevent reward hacking.

0.1

Beta Parameter

Controls the trade-off between alignment strength and model drift. Lower beta = stronger alignment. Typical range: 0.05-0.5.

Reward Distribution (Post-DPO Iteration 3)

CHOSEN RESPONSES

-2.00+2.0

REJECTED RESPONSES

-2.00+2.0

Reward gap = mean(chosen) - mean(rejected). Wider separation indicates stronger preference signal.

ITERATIVE ALIGNMENT

MULTIPLE DPO ROUNDS. COMPOUNDING GAINS.

Each DPO iteration narrows the gap between model behavior and expert expectations. After each round, Vi automatically surfaces edge cases where the model is least confident for the next preference collection batch.

Round 1

68%2,400 pairs

Broad coverage across all prompt categories. Initial alignment pass on the most obvious failure modes.

Round 2

76%1,800 pairs

Edge case mining surfaces 1,800 low-confidence samples. Focused on spatial grounding and multi-object scenes.

Round 3

82%1,200 pairs

Targeted collection on remaining failure modes: negation handling, counting errors, and fine-grained attribute distinction.

Round 4+

85%+800 pairs

Diminishing returns signal convergence. Shift to domain-specific edge cases or new prompt categories as needed.

Confidence-Based Mining

Automatically rank all unlabeled prompts by model uncertainty. Surface the samples where chosen vs rejected probability is closest to 50/50.

Category Balancing

Ensure each DPO iteration covers all prompt categories proportionally. Prevent over-fitting to common scenarios at the expense of rare but critical ones.

Failure Mode Clustering

Group rejected responses by failure type: hallucination, spatial error, reasoning gap, or format violation. Target collection at the largest remaining cluster.

Convergence Tracking

Monitor preference accuracy and reward gap across iterations. Alert when gains plateau below threshold, signaling diminishing returns on additional pairs.

Cross-Iteration Diff

Compare model outputs on the same prompt across DPO iterations. Visual diff highlights which specific behaviors improved or regressed.

Auto-Trigger Collection

Set preference accuracy targets per category. When a category drops below threshold after new data, Vi automatically queues it for the next collection round.

TECHNICAL COMPARISON

DPO VS RLHF. SIMPLER. FASTER.

Classic RLHF requires training a separate reward model and then running PPO to optimize against it. DPO collapses both stages into a single supervised loss function over preference pairs.

RLHFMulti-Stage Pipeline

Collect preferences

Same as DPO

Train reward model

Additional model to train and maintain

Run PPO optimization

Requires value model, complex hyperparameters

Evaluate and iterate

Reward model drift requires re-training

Models to Train3 (policy + reward + value)

GPU Hours (Typical)48-120h

Hyperparameters15+ (PPO-specific)

StabilitySensitive to reward hacking

DPOSingle-Stage Optimization

Collect preferences

Same as RLHF

Run DPO optimization

Direct loss on preference pairs, no reward model

Evaluate and iterate

Straightforward, no model drift concerns

Models to Train1 (policy only)

GPU Hours (Typical)8-24h

Hyperparameters3 (beta, LR, epochs)

StabilityInherently stable (KL-constrained)

CONFIGURATION

FULL CONTROL OVER THE DPO LOOP.

Every parameter of the DPO training loop is exposed and configurable. Set beta, learning rate, batch size, and convergence criteria through the dashboard or API.

DPO Training Config

{

"dpo_config": {

"beta": 0.1,

"learning_rate": 5e-7,

"epochs": 5,

"batch_size": 4,

"max_pairs": 2400,

"kl_threshold": 0.5,

"early_stop": true,

"reference_model": "base"

"edge_case_mining": {

"enabled": true,

"confidence_threshold": 0.6,

"max_samples": 2000

}

Managed Compute

DPO runs on the same managed GPU infrastructure as SFT training. A single A100 processes 2,400 preference pairs in under 4 hours. Multi-GPU support for larger datasets with automatic gradient accumulation.

A100 80GBH100 80GBMulti-GPUAuto Gradient Accumulation

Checkpoint Management

Every DPO iteration saves a new model checkpoint with full metadata: preference dataset version, training config, evaluation metrics, and parent checkpoint lineage. Roll back to any previous iteration with one click.

API Access

Trigger DPO training runs, monitor progress, and retrieve metrics programmatically. Integrate DPO into your CI/CD pipeline with the Vi Python SDK or REST API.

ALIGN YOUR VLM
WITH EXPERT PREFERENCES.

FROM PREFERENCES TO ALIGNMENT IN THREE STEPS.

Collect Preferences

Train with DPO

Evaluate Alignment

SIDE-BY-SIDE EXPERT REVIEW.

PREFERENCE DATASET OPS.

IMPLICIT REWARD MODELING.

MULTIPLE DPO ROUNDS. COMPOUNDING GAINS.

Confidence-Based Mining

Category Balancing

Failure Mode Clustering

Convergence Tracking

Cross-Iteration Diff

Auto-Trigger Collection

DPO VS RLHF. SIMPLER. FASTER.

FULL CONTROL OVER THE DPO LOOP.

Managed Compute

Checkpoint Management

API Access

ALIGN YOUR MODEL.
SHIP WITH CONFIDENCE.

ALIGN YOUR VLMWITH EXPERT PREFERENCES.

FROM PREFERENCES TO ALIGNMENT IN THREE STEPS.

Collect Preferences

Train with DPO

Evaluate Alignment

SIDE-BY-SIDE EXPERT REVIEW.

PREFERENCE DATASET OPS.

IMPLICIT REWARD MODELING.

MULTIPLE DPO ROUNDS. COMPOUNDING GAINS.

Confidence-Based Mining

Category Balancing

Failure Mode Clustering

Convergence Tracking

Cross-Iteration Diff

Auto-Trigger Collection

DPO VS RLHF. SIMPLER. FASTER.

FULL CONTROL OVER THE DPO LOOP.

Managed Compute

Checkpoint Management

API Access

ALIGN YOUR MODEL.SHIP WITH CONFIDENCE.

ALIGN YOUR VLM
WITH EXPERT PREFERENCES.

ALIGN YOUR MODEL.
SHIP WITH CONFIDENCE.