Authors: Aboli Ashok Ugale, Associate Professor Vijay B. More
Abstract: Neural codec language models (NCLMs) have re- cently emerged as a powerful paradigm for unified speech generation and transformation. By modeling discrete acoustic tokens extracted from neural audio codecs, these systems enable scalable solutions for text-to-speech (TTS), voice conversion, speech enhancement, and editing within a single generative framework. This paper presents a comprehensive review of representative models including AudioLM, VALL-E, Voice-box, NaturalSpeech 2, and SpeechX, analyzing their architectural design, probabilistic modeling strategies, computational complex- ity, and task generalization capabilities. A comparative study highlights the tradeoff between perceptual quality and inference efficiency across autoregressive and diffusion-based approaches. Furthermore, existing research gaps in discrete representation fidelity, evaluation standardization, and multi-task optimization are identified. Finally, a conceptu-al extension termed SpeechX++ is discussed to address limitations through emotion conditioning, multilingual adaptation, and efficient inference strategies. The review demonstrates the ongoing transition toward general- purpose speech foundation models capable of robust, scalable, and ethically responsible deployment.
