Configuration

Reduces VRAM by offloading KV cache to CPU RAM. Slows down inference by 20-50%.
1 (Recommended: Match concurrent users)
Number of users processed simultaneously. Higher = better throughput but requires more VRAM.
Splits large prompts into chunks to avoid VRAM spikes. Essential for high batch sizes with large contexts.
Maximum tokens processed per chunk with chunked prefill. Lower = less VRAM spike, slightly slower prefill.
Estimated single-stream TPS for this GPU configuration. Suggested: 300-600 for RTX 6000 Blackwell.
60%
Time to recover hardware costs through rental. 36 months is standard.

Results

Hardware Cost

€0
One-time purchase

Monthly OpEx

€0
Electricity + Maintenance

Monthly Rental

€0
36-month payback + profit (15% margin)

VRAM Requirements

Model Weights 0 GB
KV Cache (Generation) 0 GB
Activations (Prefill) 0 GB
Prefill Spike (Naive) 0 GB
Total Required 0 GB
Per GPU VRAM 0 GB
Chunked Prefill Savings 0 GB

Prefill Analysis

Total Tokens to Prefill 0
Prefill Chunks Needed 0
Est. Prefill Time (Chunked) 0
Est. Prefill Time (Naive) 0
Prefill VRAM Spike Risk Low

Hardware Requirements

GPUs Needed 0
Nodes Required 0
System RAM 0 GB
Storage 0 GB
Total Power 0 kW

Performance Metrics

Total VRAM Capacity 0 GB
VRAM Utilization 0%
Speed Impact 0%
KV Cache Offloaded 0 GB
Max Batch Size 0

Throughput Analysis

Base TPS (Single Stream) 0
Batch Size 0
Total System Throughput 0
Per-User Speed 0
PCIe Scaling Factor 0
Effective Throughput 0

PagedAttention Comparison

Without PagedAttention 0 GPUs
With PagedAttention 0 GPUs
GPUs Saved 0
Cost Savings €0
Performance Trade-off None

Cost Breakdown

GPU Cost €0
Base System Cost €0
Monthly Electricity €0
Monthly Maintenance €0

Rental Price Breakdown

Hardware Depreciation €0
Electricity (with cooling) €0
Bandwidth/Network €0
Support & Maintenance €0
Profit Margin (15%) €0
Total Rental Price €0

Recommendations