Zhipu.AI is trading cloud exclusivity for raw local speed. On April 15, 2025, the Beijing-based tech giant went public with its next generation of assets: the GLM-4 series and GLM-Z1 inference models. While Western competitors guard their proprietary weights, Zhipu.AI is rolling out its portfolio under a liberal MIT license and launching a dedicated international domain, Z.ai. This isn't merely academic goodwill—it is an aggressive land grab in the global inference market. The goal is simple: cement dominance before the company heads for its IPO.
200 Tokens Per Second: The Architecture of Speed
The technological core of this expansion is the GLM-Z1-32B-0414. According to developers, it is eight times faster than DeepSeek-R1. This isn't just a theoretical benchmark on industrial clusters; the model delivers 200 tokens per second on standard consumer-grade hardware. By optimizing Grouped-Query Attention (GQA), employing aggressive quantization, and utilizing speculative sampling, Zhipu.AI engineers have achieved speeds 50 times faster than the average human reading pace.
GLM-Z1 delivers 200 tokens per second on standard GPUs—50 times faster than you can read.
For CTOs, this is a game-changer: instead of relying on a cloud provider, they gain "sovereignty at the edge." The release includes the base GLM-4-32B-0414, specifically tuned for agentic workflows like web searching, tool usage, and real-time code generation (HTML, CSS, JS, SVG). By releasing compact 9-billion parameter versions, Zhipu.AI is commoditizing high-performance reasoning AI for resource-constrained systems.
Autonomous Agents and the Rumination Model
Zhipu.AI is moving beyond reactive chatbots with its "thinking" model, GLM-Z1-Rumination-32B-0414. The company claims this architecture is capable of active searching, self-verification, and iterative problem-solving for complex, open-ended queries. The bet on autonomy suggests that the future of enterprise agents lies in self-correction rather than an endless loop of prompt-and-response. Monetization plans involve a Model-as-a-Service (MaaS) platform with a flexible pricing tier, ranging from the ultra-fast GLM-Z1-AirX to the budget-friendly GLM-Z1-Air.
The Rumination model is an attempt to create an agent that can verify its own work before providing an answer.
As Zhipu.AI expands its footprint via the Z.ai web interface and mobile app, the strategy is becoming clear. Combining fast open-source models with a robust API allows the company to simultaneously court the developer community and secure major enterprise contracts. This dual-track approach builds a massive user base that serves as a powerful valuation multiplier ahead of the company's public listing.
Zhipu.AI has masterfully packaged an eightfold speed increase as a gift to the open-source community precisely when it needed to prepare for the public markets. The Z.ai international domain appeared at the exact moment the company needed to distance itself from local narratives. Global accessibility is a convenient banner for a firm that must prove to investors that its growth isn't constrained by China's borders or a shortage of sanctioned chips.