LLM Noise Injection Pipeline

LemonAI was a small London-based AI startup working on transcription and speech NLP. I joined as an applied NLP intern from January to May 2025. The core problem I worked on was that NLP models trained on clean transcripts often suffer from distribution shift in production: real speech output is full of noise that doesn't appear in training data, including filled pauses, repetitions, phonetic substitutions, and disfluencies. To close that gap, I designed and built a data augmentation pipeline to inject realistic linguistic noise into clean transcription data, expanding the training distribution without requiring additional labelled examples.

The pipeline combined a rule-based module for deterministic, high-volume noise injection with an LLM component using the Claude API for generative, contextually coherent variation, the kind of disfluency that rules alone can't replicate. The two modules were composable, allowing fine-grained control over noise type and severity to target specific failure modes observed during model evaluation. Working at a small startup meant taking ownership end-to-end: scoping the problem, surveying prior work, designing the system architecture, and iterating quickly without much process overhead. It taught me how research translates into production-facing tooling, and gave me a much clearer sense of how LLMs can be used as components inside larger data pipelines rather than as standalone tools.