The economics of large language models are currently gripped by token-based pricing, but a new tool called pxpipe has found an elegant loophole in Anthropic’s billing system. Developers have discovered that it is cheaper for AI models to "look" at images than to read text. The pxpipe utility converts long text strings into compact PNG files, exploiting the cost disparity between modalities: while text is billed per character/token, images follow a fixed pixel grid regardless of the information density packed within them. As a result, engineers can cram up to 3.1 characters of code or JSON structures into a single "visual" token.
Infrastructure Steganography vs. API Bills
The tool functions as a local proxy that intercepts requests to Claude Code or Fable 5. It identifies heavy, static blocks—system prompts, tool documentation, or deep chat history—and "photographs" them into a dense PNG. Developer Steven Chong demonstrated a case where a 48,000-character documentation block was compressed into a single page. In standard text format, this volume would have consumed 25,000 tokens; as an image, it cost the user only 2,700.
Savings on heavy contexts can reach 70%: in one Fable 5 test, session costs plummeted from $42.21 to a mere $6.06.
This shift from prompt engineering to "infrastructure steganography" signals a maturing market: when API costs are sky-high, engineers begin seeking workarounds at the protocol level. While fresh messages remain as text to preserve flexibility, the dead weight of the context is shifted to the vision encoder. This approach mirrors recent findings from DeepSeek, where documents are compressed tenfold while retaining 97% of data integrity, according to their technical reports.
The Cost of Visual Inference and Price Regulation
There is no such thing as a free lunch: swapping text tokens for pixels introduces latency and accuracy risks. Optical character recognition is inherently lossy. Steven Chong admits that specific strings, such as hashes, occasionally turn into "pumpkins" during model interpretation. Furthermore, processing is slower because input data must pass through a visual encoder. While Fable 5 maintains 100% accuracy in math tests using this method, less capable models stumble: Claude Opus fails in 7% of cases when dealing with dense rendering.
Once this trick goes mainstream, Anthropic and other providers will have a strong financial incentive to close the loophole. Corporations are unlikely to tolerate a 70% revenue leak in high-context tasks for long. We should expect a revision of multimodal pricing where costs depend on information entropy rather than image dimensions. For now, the advantage lies with those exploiting naive billing logic that assumes a picture is worth a thousand words—rather than 48,000 characters of system code.