On-Premise LLM: GPU Calculations and Infrastructure Risks

Attempting to budget for in-house neural network hardware using public calculators is the shortest path to a cash flow crisis. As the LLMStart.ru team discovered when deploying the GPT-OSS-120B model (MoE architecture), popular services like apxml.com promise the world, only for those promises to shatter during the first load test. In theory, a system powered by two RTX Pro 6000 Blackwell cards should have delivered 4,696 tokens per second. The reality was much harsher: live hardware benchmarks peaked at just 880 tokens. A five-fold discrepancy isn't a statistical margin of error—it's a fatal flaw in resource planning.

Anatomy of a Miscalculation: When Math Fails Architecture

The mechanics of this failure lie in the specifics of Mixture of Experts (MoE) models and the logic of reasoning processes. While only about 5 billion parameters out of 120 billion are active at any given moment, the load on video memory (VRAM) and inference latency do not follow linear textbook formulas. When operating in a closed circuit, a business lacks the "safety net" of cloud-based autoscaling. If your purchased hardware can't handle the user traffic, your project becomes a multi-million dollar monument to inefficient investment, as you cannot scale up on-premise capacity on the fly.

"In theory, it's a breakthrough; in practice, it's an 80% capacity shortfall. Marketing calculators consistently ignore how vLLM actually behaves under real-world stress."

Interestingly, when the load increased to eight parallel dialogues, the Time to First Token (TTFT) actually dropped by 17%. This is a counter-intuitive effect of optimization libraries that utilize GPU cores more efficiently during heavy batching than with single requests. However, this same test revealed a hidden "intelligence tax": reasoning models generate up to three "invisible" tokens for every one visible token. The GPU runs at full throttle, processing internal chains of thought, while the user waits for a concise answer.

The Infrastructure Trap: Why Specifications Lie

Public calculators promised support for eight concurrent users with throughput of nearly 5,000 tokens per second. Prefix caching technology reduced p95 latency by 67%, but even this couldn't save the overall throughput. The actual result: a five-fold performance drop compared to marketing expectations. The takeaway for CEOs: purchasing GPUs based on standard formulas without load-testing a prototype is a direct risk to your budget.

Executives must face the facts: the era of "buying by the spec sheet" is over. Sovereign infrastructure offers independence, but it removes the right to make a mistake. Without a deep audit of real-world performance on specific tasks, buying servers remains an expensive lottery where the house always wins, leaving you with a broken service and an empty bank account.

Source: Хабр ML →

Rate this material

★ ★ ★ ★ ★

AI in BusinessAI InvestmentLarge Language ModelsAI ChipsNVIDIA

The On-Premise LLM Trap: Why Your GPU Math is Likely Wrong