Text-to-Speech
Definition
AI systems that convert written text into natural-sounding spoken audio, also known as speech synthesis.
Text-to-speech (TTS) has evolved from robotic-sounding systems to voices nearly indistinguishable from human speech. Modern TTS systems like ElevenLabs, OpenAI TTS, and Google WaveNet use neural networks to generate highly natural prosody, emotion, and intonation. Key advances include zero-shot voice cloning (mimicking a voice from seconds of audio), multilingual synthesis, expressive speech with controllable emotion, and real-time generation for conversational AI. Neural TTS architectures include Tacotron, FastSpeech, and VITS. Applications span virtual assistants, audiobook narration, accessibility tools, content creation, and dubbing. The technology raises ethical concerns about voice deepfakes, consent in voice cloning, and potential for fraud through voice impersonation.
Related Terms
Generative AI
AI systems that can create new content such as text, images, audio, video, and code, rather than sim...
Natural Language Processing
A field of AI focused on enabling computers to understand, interpret, generate, and interact with hu...
Speech-to-Text
AI systems that convert spoken language into written text, also known as automatic speech recognitio...