Local SLMs for Data Anonymization: Beyond Masking

Traditional data de-identification is often a blunt instrument that transforms valuable business intelligence into useless digital noise. As Anuj Sadani and Deepak Kumar of Infrrd.ai point out, when you replace names and addresses with static placeholders like [PERSON] or [LOCATION], the data loses its utility for training NER models or building search indices in RAG systems. Modern enterprises don't just need to redact sensitive info; they need to intelligently swap it with realistic surrogates that preserve document structure and context—all without sending data to the cloud.

Infrrd’s proposed technical stack solves this through a local pipeline optimized for extreme efficiency. The architecture combines a 1.5B Mixture-of-Experts (MoE) token classifier to identify sensitive fragments with a 1-bit Small Language Model (SLM) called Bonsai-1.7B to generate replacements. This setup enables ordinary local processors to generate contextually appropriate fictitious names or dates. To maintain document integrity, the system employs a rule-based generator: if a name is swapped once, it remains consistent throughout the text. This is critical for report coherence; you can't have 'John Doe' morphing into 'Jane Smith' by the third paragraph.

The main hurdle when using models compressed to 1-bit was 'regurgitation'—where the SLM blindly copies prompt examples instead of generating new data. Infrrd's researchers demonstrated that this wasn't a quantization bug, but a failure in prompt logic. Both 1-bit and 1.58-bit ternary models produced identical errors when using static examples. The fix? A mechanism called Locale-Conditioned Few-Shot Prompting. The system uses an MD5 hash of the input to select examples from a regionally filtered pool. The result: 482 out of 482 model calls were successful, eliminating 'echoes' while maintaining linguistic precision.

However, these polished SLM outputs carry a trap for CTOs and tech leads. Infrrd’s tests provide a sobering reality check: while training an NER model on original data yielded an F1 score of 0.960, the hybrid SLM approach dropped to 0.346—lagging behind even a simple rule-based generator (0.506). The irony is that for training algorithms, the high variability of simple rules proved more useful than the 'natural' sound of a neural network. Today’s priority isn't just achieving human-like output; it’s about expanding the diversity of synthetic distributions so that compliance filters don't accidentally narrow the horizons of the models they are meant to feed.

Source: arXiv cs.AI →

Rate this material

★ ★ ★ ★ ★

Artificial IntelligenceOn-Device AIGenerative AICybersecurityInfrrd.ai