Modern large language models rely on Rotary Positional Embeddings (RoPE) to focus on relative distances between tokens. However, in practice, they still lapse into a dependency on absolute coordinates. Researchers from Sapienza University of Rome and Intuition Machines have discovered that despite RoPE's mathematical elegance, positional information "leaks" into the model through two specific architectural holes. This explains the paradox of why a model trained on relative offsets can easily distinguish a token at position 50 from an identical one at position 100, even though it theoretically should only perceive their mutual placement.
Mechanisms of positional data leakage
The first leakage point is the causal mask mechanism. The Softmax denominator for each query is calculated based on all preceding tokens, meaning it is inherently tied to the query's absolute position. The second issue lies within the residual streams, where the very first token of a sequence serves as an anchor.
In causal attention mode, the zero token sees only itself, creating a deterministic activation trajectory that subsequent attention heads read as a systemic landmark.
If the Beginning of Sequence (BOS) token is removed or replaced, this signal fades. This confirms that the model uses the start of the text as a fixed coordinate system rather than relying on pure relativity.
Implications for AI architects
For those designing model architectures, this is a wake-up call: true invariance to context length remains an elusive dream.
Residual stream leakage can be partially suppressed via NTK-scaling. Errors tend to accumulate within sliding window attention mechanisms. Reliability when processing long-form text becomes something of a lottery.
Engineering leads must understand that extending context is not merely a matter of tweaking RoPE parameters; it is a battle against implicit "anchors" at position zero. These anchors break attention predictability as soon as a sequence exceeds the training distribution.
Research takeaways
Your LLM is nowhere near as position-independent as its specifications claim. The causal mask and the initial token act as permanent beacons, tethering the model to an absolute coordinate grid. Until these fundamental architectural leaks are plugged, stable long-context performance will remain an illusion supported by engineering workarounds rather than mathematical guarantees.