PLATFORM // GPU COMPUTE
From T4 to B200, select the right GPU tier for your workload. Multi-GPU scaling with NVLink, VRAM-aware scheduling, and background job queues. No DevOps required.
| GPU Type | VRAM | CUDA Cores | Best For | Multi-GPU | Tier |
|---|---|---|---|---|---|
T4 | 16 GB | 2,560 | Inference, Small LoRA | Up to 4 | Starter |
L4 | 24 GB | 7,424 | LoRA Fine-Tuning | Up to 8 | Developer |
A10 | 24 GB | 9,216 | General Purpose Training | Up to 8 | Developer |
A100Recommended | 80 GB | 6,912 | Production SFT Training | Up to 32 | Developer |
H100 | 80 GB | 16,896 | Large-Scale Training, NVLink | Up to 64 | Professional |
B200 | 192 GB | 18,000+ | Largest Models, Multi-Node | Up to 64 | Enterprise |
MULTI-GPU SCALING
Vi manages multi-GPU orchestration automatically. Select the number of GPUs in the hardware configuration modal and Vi handles data parallelism, gradient synchronization, and NVLink topology.
Multi-GPU runs on H100 and B200 tiers use NVLink for direct GPU-to-GPU communication. Vi configures the topology automatically.
Vi handles parallelism strategy selection based on your model size and GPU count. No manual configuration of sharding or gradient sync required.
Model checkpoints saved periodically during training. Resume from any checkpoint if a run is interrupted or you want to branch from an earlier state.
Training runs server-side on managed clusters. Close your browser, shut your laptop. Vi notifies you when the run completes.
Vi estimates VRAM requirements before provisioning hardware. If your config exceeds the selected GPU tier, Vi warns you before launch.
Track loss curves, epoch progress, and GPU utilization in real time from the dashboard. Email and in-app notifications on completion.
VRAM ESTIMATOR
Vi estimates VRAM requirements based on your model size, training method, and batch configuration. Select a model and method to see the estimated VRAM and the recommended GPU tier.
Model Parameters
The base model parameter count is the primary driver of VRAM usage. A 7B model requires roughly 14 GB in FP16 just for weights alone.
Training Method
LoRA adapters add minimal overhead (typically 1-5% of base parameters). Full SFT requires 2-3x the model weight size for optimizer states and gradients.
Batch Size
Each sample in the batch requires activation memory. Gradient checkpointing trades compute for memory, reducing activation overhead by 60-80%.
Sequence Length
Longer input sequences require quadratically more attention memory. Flash Attention 2 reduces this to near-linear scaling.
VRAM Estimation Table
Qwen 7B
LoRA (FP16)
Qwen 7B
Full SFT (FP16)
Qwen 7B
LoRA (NF4)
Qwen 32B
LoRA (FP16)
Qwen 32B
Full SFT (FP16)
Qwen 72B
LoRA (NF4)
InternVL 8B
LoRA (FP16)
InternVL 38B
Full SFT (FP16)
Estimates include model weights, optimizer states, gradients, and activation memory with gradient checkpointing enabled. Actual usage may vary by 10-15% depending on sequence length and batch size.
INFRASTRUCTURE
Use Vi Cloud for zero-setup managed training, or connect your own GPU cluster with custom runners. Your data stays in your infrastructure. Vi orchestrates the training pipeline either way.
Your Data
S3 / Azure / GCS
Base Model
Qwen / InternVL / Cosmos
Your GPUs
Custom Runners (BYOG)
Datature Vi
Orchestration Layer
Vi Cloud
T4 to B200 Managed
Deployment
SDK / NIM / API
Monitoring
Metrics / Alerts / Logs
Zero-setup GPU access from T4 to B200. Vi provisions, configures, and deallocates hardware automatically. Pay per training run.
Connect your existing GPU cluster via custom runners. Vi orchestrates training on your hardware while keeping data in your infrastructure.
Use Vi Cloud for development and prototyping. Switch to your own cluster for production runs. Same training config, different compute backend.
MEMORY OPTIMIZATION
Train larger models on smaller GPUs with 4-bit NormalFloat quantization. A 7B model drops from 28 GB to 7 GB VRAM. Combined with LoRA, you can fine-tune models that would otherwise require multi-GPU setups on a single card.
Standard half-precision floating point. No quality loss. Recommended for final production runs and full supervised fine-tuning where maximum model quality is required.
8-bit integer quantization with dynamic range calibration. Virtually no quality degradation on most benchmarks. Good balance of memory savings and training stability.
NormalFloat 4-bit quantization with double quantization. Enables 7B LoRA fine-tuning on a T4 (16 GB). Recommended for rapid iteration and fitting larger models into limited VRAM budgets.
Example: Qwen 7B VRAM Breakdown
28 GB
FP16 Full SFT
14 GB weights + 14 GB optimizer
12 GB
FP16 LoRA
14 GB weights + minimal adapter overhead
7 GB
NF4 LoRA
3.5 GB weights + minimal adapter overhead
COST MANAGEMENT
Vi routes training jobs to the most cost-effective instance type for your workload. Background job queues batch runs for off-peak scheduling, the VRAM estimator prevents over-provisioning, and automatic instance selection keeps costs low.
Instance Routing
Vi routes training jobs to the most cost-effective instance type for your workload. Spot instances when available, on-demand when needed. Automatic failover between instance types.
Background Job Queue
Launch training runs and close your browser. Vi executes jobs in the background and sends email notifications when runs complete or encounter errors.
Right-Size GPU Selection
The VRAM estimator recommends the cheapest GPU tier that fits your workload. Stop paying for A100s when a T4 is sufficient for your LoRA fine-tune.
Academic Pricing
Discounted compute rates for verified academic and research institutions. Contact the Vi team with your institutional email for eligibility verification.
Fully managed GPU infrastructure. No cloud account required. Vi provisions, schedules, and deallocates GPUs automatically. Pay per GPU-hour with per-second billing.
Connect your existing AWS, GCP, or Azure account. Vi orchestrates training jobs on your cloud infrastructure while you retain full control over billing and data residency.
WORKFLOW
Configuring and launching a GPU training run takes under two minutes. No Dockerfiles, no CUDA drivers, no Kubernetes manifests. Select your hardware, configure the run, and monitor results.
Open the Hardware Configuration modal. Choose your GPU type and quantity from the available tiers. The VRAM estimator shows whether your selection fits the model and training method.
Set your training hyperparameters, dataset split, and output checkpoint location. Vi validates the full configuration against the selected hardware before launch.
Launch the run and track progress in real-time. Loss curves, VRAM usage, and throughput metrics stream to the Neural Monitor dashboard. Email notification on completion.
{
"gpu_type": "A100",
"gpu_count": 4,
"model": "Qwen2.5-VL-7B",
"method": "lora",
"quantization": "fp16",
"lora_rank": 16,
"epochs": 3,
"batch_size": 8,
"learning_rate": 2e-4,
"spot_instance": true,
"checkpoint_interval": 500,
"notify_on_complete": true
}
Managed GPU infrastructure from T4 to B200. Start free today.