LLM-based multi-agent systems are often marketed as a direct path to autonomous reasoning, but in practice, scaling them frequently turns into a Sisyphus-like task with diminishing returns. Research by Blaž Bertalanič and Carolina Fortuna from the Jožef Stefan Institute confirms that silicon intelligence is just as prone to the "Ringelmann effect" as its human counterpart. An individual agent's efficiency drops as soon as the team begins to expand. While marketers debate whether "swarms" improve response quality, researchers have derived a mathematical scaling law: R(N) = 1/(1 + c(N-1)N^-β). This law determines how many nominal agents actually provide independent evidence and how many merely generate redundant noise.

Three Modes of Agent Performance

The core of the matter lies in classifying performance into three regimes based on the exponent β. A configuration can have a "hard ceiling" (β=0), grow sublinearly (0<β<1), or linearly (β≥1). In a hard-ceiling scenario, adding new agents provides exactly zero additional diversity: thirty agents arguing over MMLU-Hard tasks produce no more insight than one. For most models on closed tasks, this represents a structural wall; further scaling is simply burning budget with no chance of progress.

"A pilot run on a sample of N≤5 allows for the prediction of the structural ceiling for N=30. In the tested configurations, only architectural diversity (heterogeneous teams) reduces the 'c' coefficient and allows for breaking through the ceiling; changing the communication mode is powerless here."

The study's methodology is impressive: an analysis of 44 configurations across Qwen, Llama, Ministral, and Gemini models showed that the mathematical model fits the data with R2 > 0.99 accuracy. On open-ended math problems, dense peer influence can actually crash the system from sublinear growth into the hard-ceiling mode. For tech leads, this is a wake-up call: logic often breaks because agent interaction fosters conformity rather than mutual error-checking.

Mean-Field Theorem and Design Collapse

One of the most pragmatic findings concerns the trade-off between the number of agents and the duration of the discussion. The mean-field theorem predicts that the number of participants (k) and the number of debate rounds (τ) influence system dynamics only through their product—kτ. Put simply, a crowd of agents is mathematically interchangeable with a long discussion. This "design space collapse" simplifies deployment planning: instead of a multidimensional grid search, a single short pilot is sufficient.

"In homogeneous teams, the gains usually attributed to 'debate' actually stem from the model re-evaluating its own thoughts, rather than from content received from peers."

Interestingly, using a "noise placebo" to track self-correction confirmed that in groups of identical models, the benefit of discussion is an illusion. The model is merely double-checking itself. The current trend of creating "swarms" from the same LLM version appears blatantly inefficient. The only way to lower the 'c' constant and move beyond the hard ceiling is through architectural diversity. Utilizing heterogeneous teams from different model families works better than any attempt to fine-tune communication protocols or simply adding another dozen identical agents.

AI AgentsLarge Language ModelsAI in BusinessAI InvestmentLlama