Speech-to-Text
Definition
AI systems that convert spoken language into written text, also known as automatic speech recognition (ASR) or speech recognition.
Speech-to-text technology has progressed from command-based systems to highly accurate continuous speech recognition. Modern ASR systems like OpenAI's Whisper, Google Speech-to-Text, and Deepgram use deep learning architectures (primarily transformers and conformers) trained on hundreds of thousands of hours of labeled audio. Whisper demonstrated that a single model can handle multiple languages, accents, and background noise conditions with high accuracy. Key challenges include handling diverse accents, background noise, multiple speakers, domain-specific terminology, and real-time processing. Applications include voice assistants, meeting transcription, closed captioning, medical dictation, and accessibility tools. The technology has become remarkably accurate, often exceeding 95% word accuracy for clear speech in supported languages.
Related Terms
Natural Language Processing
A field of AI focused on enabling computers to understand, interpret, generate, and interact with hu...
Text-to-Speech
AI systems that convert written text into natural-sounding spoken audio, also known as speech synthe...
Transformer
A neural network architecture introduced in 2017 that uses self-attention mechanisms to process sequ...